# Theoretical Questions

* This is the theoretical part of the final project. It includes theoretical questions from various topics covered in the course.
* There are 7 questions among which you need to choose 6, according to the following key:
    + Question 1 is **mandatory**.
    + Choose **one question** from questions 2-3.
    + Question 4 is **mandatory**.
    + Questions 5-6 are **mandatory**.
    + Question 7 is **mandatory**.
* Question 1 is worth 15 points, whereas the other questions worth 7 points.
* All in all, the maximal grade for this parts is 15+7*5=50 points.
* **You should answer the questions on your own. We will check for plagiarism.**
* If you need to add external images (such as graphs) to this notebook, please put them inside the 'imgs' folder. DO NOT put a reference to an external link.
* Good luck!

## Part 1: General understanding of the course material

### Question 1

1.  Relate the number of parameters in a neural network to the over-fitting phenomenon (*).
    Relate this to the design of convolutional neural networks, and explain why CNNs are a plausible choice for an hypothesis class for visual classification tasks.

    (*) In the context of classical under-fitting/over-fitting in machine learning models.

#### Answer:
More parameters in a neural network means that the learned metwork can be more complex and sueted to the training data we used in the training procces. <br>
Complex model is much more prone to overfitting, since it can "remember" more specific relations from the training set thus overfitted to it. <br>
If we consider the number of parameters of convolutional neural networks, this number depends on the number of layers (might be activation, pooling,  <br>
fully connected, convulotion etc.), channels in each layer, kernel, stride and padding sizes. So increasing any of those parameter may lead to overfitting. <br>
From the other hand, CNN tries to solve the overfitting phenomenon by reducing the number of parameters (versus FC for example) by using weighs in size of <br>
kernel_size*kernel_size only instead of the entire size of the image as weighs. Another things is the fact the CNN let us detect pattern in given image by  <br>
appling the kernel on the entire image, thus, finding pattern in the image regardless of its position in the images.


2. Consider the linear classifier model with hand-crafted features:
    $$f_{w,b}(x) = w^T \psi(x) + b$$
    where $x \in \mathbb{R}^2$, $\psi$ is a non-learnable feature extractor and assume that the classification is done by $sign(f_{w,b}(x))$. Let $\psi$ be the following feature extractor $\psi(x)=x^TQx$ where $Q \in \mathbb{R}^{2 \times 2}$ is a non-learnable positive definite matrix. Describe a distribution of the data which the model is able to approximate, but the simple linear model fails to approximate (hint: first, try to describe the decision boundary of the above classifier).

#### Answer:
The decision boundry is achived when $$f_{w,b}(x) = 0$$ because the classification is done by $sign(f_{w,b}(x))$.
So:  $$ f_{w,b}(x) = w^T \psi(x) + b = w^T x^TQx + b = 0$$ <br>
Which is equation from 2nd degree, meaninig the decision boundry is an hyperplane of 2nd degree.
Thus, the distribution of the data which the model is able to approximate are linear (included in the space of 2nd degrees field) too,
but it also can approx. data that it's features have up to 2nd degree relations between them.


3. Assume that we would like to train a Neural Network for classifying images into $C$ classes. Assume that the architecture can be stored in the memory as a computational graph with $N$ nodes where the output is the logits (namely, before applying softmax) for the current batch ($f_w: B \times Ch \times H \times W \rightarrow B \times C$). Assume that the computational graph operates on *tensor* values.
    * Implement the CE loss assuming that the labels $y$ are hard labels given in a LongTensor (as usual). **Use Torch's log_softmax and index_select functions** and implement with less as possible operations.

#### Answer:
$$ CE_LOSS(x, y) =  -1 \cdot sum(y \cdot log(model(x))) = -1 \cdot sum(y \cdot y_probs) $$ <br>

In [1]:
from torch.nn.functional import log_softmax
from torch import index_select
import torch

# in ordr to sccessfully run this NB we wrapped all functions. Arguments and return values are arbitrery

# Input:  model, x, y.
# Output: the loss on the current batch.
def wrapper(model, x, y):
    logits = model(x)
    y_probs = log_softmax(input=logits, dim=1)
    loss = -1 * torch.sum(index_select(input=y_probs, dim=1, incices=y))
    return loss

* Using the model's function as a black box, draw the computational graph (treating both log_softmax and index_select as an atomic operations). How many nodes are there in the computational graph?

#### Answer:
There will be 7 nodes or 8 if we consider the -1 constant for the sum:
<center><img src="imgs/ce_loss.png" /></center>


* Now, instead of using hard labels, assume that the labels are representing some probability distribution over the $C$ classes. How would the gradient computation be affected? analyze the growth in the computational graph, memory and computation.

#### Answer:
Now the computational graph would look like this:
<center><img src="imgs/ce_loss2.png" /></center>

Meaning we will must add another step to our gradient computation - the dot product between y and y_log_prob, as now it isn't one-hot and it means we must take into account the entire probabilties in the calc (when it was one-hot we could look only at one probability - 1).
It means that both the memory and computation usage will be increased.


* Apply the same analysis in the case that we would like to double the batch size. How should we change the learning rate of the optimizer?

#### Answer:
In case we would double the batches size, the gradient computation would stay (pretty much) the same but the gradient itself will be more accurate since we consider more samples in the batches. Also we would use more GPU mem at a given time, since we would like to store the data from the current batch. The computational graph will stay the same size but the operation will change (e.g. from sum of 16 values it would increase to sum of 32 values). Finally, the learning rate must be decreased in order to maintain roughly the same sizes of steps as before, since we aggragate the gradients of the batches (which we increased the size of them).

## Part 2: Optimization & Automatic Differentiation

### Question 2: resolving gradient conflicts in multi-task learning

Assume that you want to train a model to perform two tasks: task 1 and task 2.
For each such task $i$ you have an already implemented function *loss\_i = forward_and_compute_loss_i(model,inputs)* such that given the model and the inputs it computes the loss w.r.t task $i$ (assume that the computational graph is properly constructed). We would like to train our model using SGD to succeed in both tasks as follows: in each training iteration (batch) -
* Let $g_i$ be the gradient w.r.t the $i$-th task.
* If $g_1 \cdot g_2 < 0$:
    + Pick a task $i$ at random.
    + Apply GD w.r.t only that task.
* Otherwise:
    + Apply GD w.r.t both tasks (namely $\mathcal{L}_1 + \mathcal{L}_2$).

Note that in the above formulation the gradient is a thought of as a concatination of all the gradient w.r.t all the models parameters, and $g_1 \cdot g_2$ stands for a dot product.

What parts should be modified to implement the above? Is it the optimizer, the training loop or both? Implement the above algorithm in a code cell/s below

#Answer

In [4]:
import torch
import torch.optim as optim
import random

def wrapper_2(Model, num_epochs, data_loader, forward_and_compute_loss_1, forward_and_compute_loss_2):

    model = Model()
    optimizer = optim.SGD(model.parameters(), lr=0.01)

    for epoch in range(num_epochs):
        for batch in data_loader:
            optimizer.zero_grad()
            loss_1 = forward_and_compute_loss_1(model, batch)
            loss_1.backward()
            grad_1 = torch.cat([p.grad.view(-1) for p in model.parameters()])

            optimizer.zero_grad()
            loss_2 = forward_and_compute_loss_2(model, batch)
            loss_2.backward()
            grad_2 = torch.cat([p.grad.view(-1) for p in model.parameters()])

            dot_product = torch.dot(grad_1, grad_2)

            optimizer.zero_grad()

            if dot_product < 0:
                task_choice = random.choice([1, 2])
                if task_choice == 1:
                    final_loss = loss_1
                else:
                    final_loss = loss_2
            else:
                final_loss = (loss_1 + loss_2)

            optimizer.step()
            final_loss.backward()
            optimizer.step()

    return 0

The primary component that needs modification is the training loop. The training loop has to be adapted to handle the complexities of multiple tasks, including computing the losses and gradients for each task, calculating the dot product of the gradients, and deciding which tasks to update based on that dot product.

The optimizer, on the other hand, doesn't need to be modified. we can use a standard optimizer like SGD. It's the training loop that dictates how this optimizer is used, based on the specific conditions we set, such as the dot product of the gradients for the two tasks.

So, in summary, it's primarily the training loop that needs to be customized to implement our multi-task training algorithm, while the optimizer can remain the same.

### Question 3: manual automatic differentiation

Consider the following two-input two-output function:
$$ f(x,y) = (x^2\sin(xy+\frac{\pi}{2}), x^2\ln(1+xy)) $$
* Draw a computational graph for the above function. Assume that the unary atomic units are squaring, taking square root, $\exp,\ln$, basic trigonometric functions and the binary atomic units are addition and multiplication. You would have to use constant nodes.
* Calculate manually the forward pass.
* Calculate manually the derivative of all outputs w.r.t all inputs using a forward mode AD.
* Calculate manually the derivative of all outputs w.r.t all inputs using a backward mode AD.

## Part 3: Sequential Models

### Question 4: RNNs vs Transformers in the real life

In each one of the following scenarios decide whether to use RNN based model or a transformer based model. Justify your choice.
1. You are running a start-up in the area of automatic summarization of academic papers. The inference of the model is done on the server side, and it is very important for it to be fast.
2. You need to design a mobile application that gathers small amount of data from few apps in every second and then uses a NN to possibly generate an alert given the information in the current second and the information from the past minute.
3. You have a prediction task over fixed length sequences on which you know the following properties:
    + In each sequence there are only few tokens that the model should attend to.
    + Most of the information needed for generating a reliable prediction is located at the beginning of the sequence.
    + There is no restriction on the computational resources.

####Answer
1: Transformer model

In the context of running a start-up for the automatic summarization of academic papers, a Transformer-based model is the most suitable. Transformers are better at managing long sequences and capturing long-range dependencies, which are essential for summarizing academic papers effectively. Many of the early content of the paper (such as the research question answered throughout the paper, or maybe an experiment description presented at the beginning of it) is required at later parts, and  transformers do a better job at those long range dependencies. Additionally, since the inference is done on the server side, computational resources are less likely to be a constraint compared to a mobile application. The parallelizable nature of transformers also makes the inference process faster, which is most impotant for server-side applications.

2: RNN model

For designing a mobile application that gathers a small amount of data from a few apps every second and then uses a neural network to possibly generate an alert, an RNN model is the most suitable. The main reason is the limited computational resources that are available on mobile phones. RNNs generally have lees computations and so are more efficient than transformers. RNN are capable of handeling this task with good results since the task involves real-time analysis of a small amount of sequential data, so long relations are not needed to have much weight on the alert desicion.

3: Transformer model

In a prediction task over fixed length sequences where only a few tokens are important for generating a reliable prediction, a Transformer model is the most suitable. The main reason is that most of the information needed for generating a reliable prediction is located at the beginning of the sequence, and transformers can easily attend to these important tokens and give them higher considerability that other tokens.In addition,In each sequence there are only few tokens that the model should attend to, and transformers are capable of this task using its attention mechanism. Moreover, there are no restrictions on computational resources, so RNN have a disadvantage compared to the more complex and generally flexible transformer.

## Part 4: Generative modeling

### Question 5: VAEs and GANS

Suggest a method for combining VAEs and GANs. Focus on the different components of the model and how to train them jointly (the objectives). Which drawbacks of these models the combined model may overcome? which not?

####Answer

**Structure of the combined model:**

The combined VAE with GAN model integrates the components of both Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). It consists of an encoder, a decoder, and a discriminator. The encoder maps input data samples to a latent space, producing parameters for a latent variable distribution. The decoder, acting as the generator in this combined framework, takes points from the latent space and reconstructs data samples. The discriminator's role is to differentiate between genuine data samples and those generated by the decoder.

Training the combined VAE with GAN model involves optimizing multiple objectives concurrently. The encoder and decoder are trained using the VAE's reconstruction loss and the KL Divergence. The decoder also aims to act as the generator part of the GAN, and generate samples that try and fool the discriminator, similar to the generator's objective in traditional GANs. The discriminator is trained to correctly classify real samples and distinguish them from the fake samples produced by the decoder. The joint loss function for the encoder and decoder can be represented as:
$$ \text{Loss}_{\text{combined}} = \text{Reconstruction Loss} + \lambda \times \text{KL Divergence} - \gamma \times \text{Generator Loss} $$
Where:
- Reconstruction Loss: Measures how well the decoder can recreate the original input after encoding and decoding.
- KL Divergence: Penalizes the encoder if the latent variable distribution deviates from a standard normal distribution.
- Generator Loss: Encourages the decoder to produce samples that the discriminator classifies as real.
- $\lambda$ and $\gamma$ : Hyperparameters that balance the contribution of each component to the overall loss.

**Drawbacks overcome by the combined model:**

Both VAEs and GANs have their own challenges. VAEs may ensure a structured latent space, although it often produce blurrier generated images compared to GANs. GANs, on the other hand, are known for generating high-quality, sharp images but can suffer from training instability and mode collapse. By combining VAEs and GANs, the model can leverage the strengths of both architectures. The structured latent space of VAEs can aid in the stable training of GANs, while the GAN component can enhance the sharpness and quality of the samples produced by the VAE.

**Drawbacks the combined model my not overcome:**

Despite the advantages of the combined VAE with GAN approach, some challenges are present. The combined model becomes more complex, introducing additional hyperparameters like $ \lambda $ and $ \gamma $ that need careful tuning. This can make the training process more complex and sensitive to hyperparameter values. Moreover, while the combined model can better handle issues like mode collapse to an extent, it's not entirely immune to it. Balancing the objectives of reconstruction, KL divergence, and generation might lead to neither being optimized perfectly. The combined model might still require more computational resources and time to train due to the added complexity.


### Question 6: Diffusion Models

Show that $q(x_{t-1}|x_t,x_0)$ is tractable and is given by $\mathcal{N}(x_{t-1};\tilde{\mu}(x_t,x_0),\tilde{\beta_t}I)$ where the terms for $\tilde{\mu}(x_t,x_0)$ and $\tilde{\beta_t}$ are given in the last tutorial. Do so by explicitly computing the PDF.

#### Answer
From the last tutorial:
$$ \tilde{\mu}(x_t,x_0) := \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t$$
and:
$$\tilde{\beta}_t := \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t$$

Lets look at the Gaussian distribution:
$$ q(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$

Lets condiser $ \tilde{\mu}(x_t,x_0) = μ,    \tilde{\beta}_t = \sigma^2   $
 and assign the given valus to the dist:

$$ q(x_{t-1}|x_t,x_0) = \frac{1}{\sqrt{2\pi\tilde{\beta}_t}} e^{-\frac{(x_{t-1}-\tilde{\mu}(x_t,x_0))^2}{2\tilde{\beta}_t}} =$$

$$= \frac{1}{\sqrt{2\pi\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t}} e^{-\frac{(x_{t-1}-\tilde{\mu}(x_t,x_0))^2}{2\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t}} $$


The expression represents a Gaussian distribution with the specified mean and variance, confirming its tractability. This confirms that $ q(x_{t-1}|x_t,x_0) $ is tractable and is given by $ N(x_{t-1};\tilde{\mu}(x_t,x_0),\tilde{\beta}_t I) $.

## Part 5: Training Methods

### Question 7: Batch Normalization and Dropout

For both BatchNorm and Dropout analyze the following:
1. How to use them during the training phase (both in forward pass and backward pass)?
2. How differently they behave in the inference phase? How to distinguish these operation modes in code?
3. Assume you would like to perform multi-GPU training (*) to train your model. What should be done in order for BatchNorm and dropout to work properly? assume that each process holds its own copy of the model and that the processes can share information with each other.

(*): In a multi-GPU training each GPU is associated with its own process that holds an independent copy of the model. In each training iteration a (large) batch is split among these processes (GPUs) which compute the gradients of the loss w.r.t the relevant split of the data. Afterwards, the gradients from each process are then shared and averaged so that the GD would take into account the correct gradient and to assure synchornization of the model copies. Note that the proccesses are blocked between training iterations.

####Answer

**BatchNorm:**

In the training phase of BatchNorm, the activations of each layer in a mini-batch are normalized during the forward pass.For each feature we subtract the mean and dividing by the standard deviation in order to normalized the batch. We also keep the moving averages of the mean and variance for each feature for later use during inference. The normalized activations are then processed through the activation function, which yields results for further computation in the network. In the optimization process, gradients related to the normalized activations, gamma, and beta are computed in the backward pass(the batchNorm hyperparameters handeling the scaling and shifting respectevly)

In the inference phase, the mean and variance of the activations for each feature channel in a mini-batch are not computed. We instead use the moving averages of the mean and variance (which we recorded during training) are used to normalize the activations. This ensures consistency in the normalization process and makes the inference more stable. The normalized values are then adjusted using the learned parameters, gamma and beta.

With multi-GPU BatchNorm training, the forward pass sees each GPU computing its batch statistics independently. After the forward pass on every GPU, the statistics(mean and variance) are averaged across all GPUs. This ensures a consistent normalization approach(There are less common types of combining the statistics like weighted average). These syncronized statistics are then used for normalization on each GPU during both the forward and backward passes. To maintain uniformity, the gamma and beta parameters should be synchronized or shared among all GPUs.



---

**Dropout:**

In the dropout training phase, during the forward pass, a fraction of the activations is randomly set to zero, in order to prevent potential overfitting. We us the dropout rate hyperparameter to set the probability of each neuron being dropped out. In the backward pass, however, gradients are computed as if they remained active even if ther were dropped out. This means that the gradients remain unaffected by the dropout process.

In the inference phase of dropout,generally all neurons are functioning to ensure inference is using all the learnt weights. However, a key point to note is the potential need for weight scaling during this phase. This adjustment is often done by multiplying the weights by the dropout rate.

With multi GPU training using dropout, we need to make sure that all GPUs are working on the same dropout state- the same dropout mask. If the mask is generated at random, we can use a shared random seed or by synchronizing the masks between all the GPUs, in order to achive it.
