Pull requests are welcome :)
This a list of questions given by my teacher Nwoye Chineduinnocent to prepare for the deeplearning exam of the deeplearning course of Telecom Strasbourg.
I think these questions give a good overview of deeplearning so my goal is to fill this list with high quality answers. No bs every word matters. Keep the answers concise.
The goal is not to have this be understandable by a total beginner but to be a good reference for someone who already knows the basics of deeplearning.
You can find questions without answers in the file questions.md
- What is artificial intelligence?
A branch of computer science that aims to create programs that try to solve problems using human-like intelligence.
- What is an artificial neural network?
It is a computational model inspired by how the human brain works. It has proven to be effective at solving problems in many domains.
- How does ANN mimic the human brain?
It learns how to produce the correct value to a given question based on a feedback loop. It learns by adjusting its weights and biases on a large amount of data by comparing its output to the expected output.
- What is machine learning?
Machine learning is an artificial intelligence technique where the machine learns by ingesting large amounts of data.
- What is deep learning?
Deep learning is the process for a machine to learn by using an artificial neural network and data.
Deep learning techniques belong to a category within machine learning techniques. The main difference is that their architectures consist of multiple layers, which allows for such models to learn feature hierarchies. Hence, layers of deep learning models learn intermediate representations of the data gradually, up until the desired outcome.
- How does ANN learn complex patterns and relationships in data?
It can learn complex patterns thanks to its different weights. What makes the neural network powerful is its ability to depict non linear patterns with the activation functions. A network can approximate any function. (Universal approximation theorem)
- What is an induced field in ANN?
The induced field is the sum of the products of the weights and the inputs of a node.
-
What is an activation function?
Function that we apply to the input of a node to introduce non-linearity in the network.
-
How is activation function useful in the training of a neural network?
-
Mention at least 10 activation functions used in neural network and their major properties?
- step
- sigmoid
- relu
- tanh
- silu
-
Write the mathematical formula of at least 10 activation functions.
Binary step:
$f(x) = \begin{cases} 0 & \text{if } x < 0 \\ 1 & \text{if } x \ge 0 \end{cases}$ $\sigma(x) = \frac{1}{1 + e^{-x}}$ $\text{ReLU}(x) = \max(0, x) = \begin{cases} 0 & \text{if } x \le 0 \\ x & \text{if } x > 0 \end{cases}$ $\text{Leaky ReLU}(x) = \begin{cases} 0.01x & \text{if } x \le 0 \\ x & \text{if } x > 0 \end{cases}$ $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ $\text{SiLU}(x) = \frac{x}{1 + e^{-x}}$ -
Mention at least 3 activation functions that scale the values of their inputs between 0 and 1?
sigmoid step gaussian
-
Why is an activation function so powerful?
Nonlinearity
-
Mention 5 ways a feedforward neural network is different from a recurrent one?
Data flows in one direction No memory
-
Give 3 examples each of feedforward neural network and recurrent neural network?
Feedforward neural networks: perceptron, MLP, CNN
Recurrent neural networks: LSTM, BiLSTM, recurrent MLP
-
What is supervised learning?
We have labels associated to each value in the data
-
What is unsupervised learning?
We Don't have labels associated to each value in the data
-
Give 3 examples of tasks performed using supervised and unsupervised learning?
supervised learning: image classification, sentiment analysis, machine translation
unsupervised learning: clustering, dimensionality reduction, anomaly detection (e.g. in time series)
-
What is self-supervised learning?
A paradigm in machine learning where an algorithm generates labels from the data itself and uses these learned labels in a supervised manner.
-
What is weakly-supervised learning?
In weakly-supervised learning, the algorithms use imprecise labels predicted with a help of external methods that do not guarantee full accuracy, like label prediction functions.
-
What is semi-supervised learning?
Only a part of data is labeled and the rest of data is unlabeled. A semi-supervised algorithm uses the outputs learned from the labeled data as examples to predict labels of unlabeled instances.
-
When would you choose to train your model in an unsupervised manner?
When labels are not available and it's too expensive to label the data
-
Mention the 3 standard splits of a dataset?
Training set, validation set, test set
-
What are the uses of each of the splits of dataset in deep learning experiment?
Training set is used to train the model.
Validation set is used to select the best hyperparameters of the model and to prevent the model from overfitting to the training set data. If the model's performance on the validation set starts to decrease, while the performance on the training set continues to increase, this is a sign that the model is overfitting.
Test set is used to evaluate the model and estimate its performance on unseen data.
-
Differentiate between linear and non-linear classifier?
Linear classifiers find a straight line (or a hyperplane) to separate instances into distinct classes. Non-linear classifiers can find more complex separation that is not linear.
-
Give 2 examples each for linear and non-linear classifiers?
Linear classifier: simple perceptron, logistic regression.
Non-linear classifier: multi-layer neural networks, decision trees.
-
What type of data do you require to use a non-linear classifier?
Data that is not linearly separable.
-
What is a loss function in deep learning?
A function that is used to evaluate the quality of a neural network's output. It is used to compute the difference between the predicted output and the true output.
-
Give 5 examples of loss functions used for classification task?
cross-entropy binary cross-entropy categorical cross-entropy KL divergence hinge loss
-
Give 5 examples of loss functions used for regression task?
mean squared error (MSE) mean absolute error (MAE) Huber
-
How does a loss function solve the maximum likelihood expectation?
A loss function describes an error in the prediction, a difference between the predicted solution and real solution. By minimizing a loss function, we make the predicted and real distributions more similar, thus we maximize the likelihood of the predicted distribution given the real distribution.
-
How does a loss function minimize the difference between the predicted and actual probabilities?
A loss function computes the difference between the predicted probability and the actual probability for every instance. Thus, by searching for parameters of a model that minimize the value of a loss function, the model starts to predict probabilities that are as closest to the real ones as possible.
-
What is the difference between binary cross-entropy and categorical cross-entropy?
Categorical cross-entropy is used for multi-class classification problems, where the output layer has multiple neurons, each corresponding to a class label. The softmax activation function is applied to the output layer to obtain a probability distribution over the classes.
$$\text{softmax}(s) = \frac{e^{s_i}}{\sum_{j=1}^{C} e^{s_j}} $$ $$\text{CategoricalCE} = {-\sum_{i}^{C} t_i \log(softmax(s)_i)} $$ where C is the number of classes, t_i is the ground truth label for class i, and softmax(s)_i is the predicted probability for class i.
Binary cross-entropy is used for binary classification problems, where the output layer has a single neuron that predicts the probability of the positive class. The sigmoid activation function is applied to the output layer to obtain a probability value between 0 and 1.
$$sigmoid(s)_i = \frac{1}{1 + e^{-s_i}}$$ $$\text{BinaryCE} = - \sum_{i=1}^{C' = 2} t_i \log(sigmoid(s)_i) = -t_1 \log(sigmoid(s_1)) - (1 - t_1) \log(1 - sigmoid(s_1)) $$ where t_1 is the ground truth label for the positive class, and sigmoid(s)_1 is the predicted probability for the positive class.
-
When do you use a binary cross-entropy over a categorical cross-entropy loss function?
When you have a binary classification problem.
-
When do you use a categorical cross-entropy over a binary cross-entropy loss function?
When you have a multiple classes classification problem.
-
What is the main difference between MAE and MSE loss function?
MAE uses absolute error (L1) and MSE uses squared error (L2).
-
What is the mathematical formula for MSE?
$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$$ -
What is the mathematical formula for Huber loss?
$$\text{HuberLoss} = \frac{1}{n} \sum_{i=1}^{n} L_\delta (\hat{y}_i - y_i)$$ Basically, in Huber loss, we use MSE when the difference between the prediction and the actual value is less than threshold delta and MAE otherwise.
-
What is margin contrastive loss?
Loss function used in similarity learning models. The goal is to put the similar examples closer to each other and disimilar ones further away from each other in a feature space. If the distance between similar ones is larger than a certain margin, the loss is positive. If the distance between disimilar ones is smaller than this margin, the loss is again positive.
-
What is Regularization?
Regularization is a technique that either adds a penalty term to the loss function (like L1, L2 or elastic net regularizations) or stops the training before the model overfits (early stopping) or zeroes out some random neurons (in case of neural networks) so that the model does not rely too much on some specific neurons in its predictions (dropout)
-
Why do you need regularization in deep learning model training?
In order to prevent a model from overfitting. The goal of regularization is that the model doesn't perform too well on the training data and that is can generalize on the unseen data.
-
Outline the training loop?
Forward pass
Loss computation
Backward propagation
Parameters update
-
What is local minimum?
A point where a function value is minimal for a certain interval and the gradient of the function equals zero.
-
What is global minimum?
A point where a function value is minimal for its entire definition domain and the gradient of the function equals zero.
-
How can a model get stuck in the local minimum?
It can get stuck during the optimization phase, when the gradient becomes zero in a point that is not a global minimum.
-
How can you get your model out of a local minimum?
We can:
-
use momentum in our gradient descent, i.e. the direction of previous gradients in order to skip the shallow minimum and to continue the search (think of momentum as you usually imagine it in fast moving physical objects).
-
use advanced optimizers (AdaGrad, RMSProp, Adam) that use adaptive learning rate and momentum
-
introduce stochasticity to gradient descent in order to explore random points
-
-
How can you compute a gradient of a function over a computational graph?
Use backpropragation algorithm (chain rule)
-
Give 5 reasons why you need a GPU over a CPU in model training?
GPUs can calculate multiple operations simultaneously (thus, training is going faster)
GPUs are dedicated to perform simple floating point operations, which are used in neural network training, and have fewer transistors dedicated to cache or flow controls
GPUs have video RAM (VRAM) which allows for faster memory access
GPUs can be used in distributed computing environments and therefore offer a better scalability for large deep learning projects
Many GPUs have specialized hardware for deep learning, such as tensor cores
-
What is a deep learning framework?
A library that provides tools specific to the development, training and evaluation of deep learning architectures
-
Give 4 examples of a standard deep learning framework?
PyTorch
Tensorflow
Keras
Theano
-
Give 4 importance of using a deep learning framework?
-
What does it mean that a deep learning framework's graph is static?
-
What does it mean that a deep learning framework's graph is dynamic?
-
Give 3 examples of dynamic deep learning frameworks?
-
What is a tensor?
A tensor in deep learning is a data structure like multidimensional array but that can be run on either CPU or GPU.
-
Mention 4 possible data types of a tensor?
float (e.g., float32, float64) integer (e.g., int8, int32, int64) boolean (bool) complex (e.g., complex64, complex128)
-
What are the properties of a tensor?
data type, rank (number of dimension), shape (number of elements in each dimension)
-
How can you calculate the dimension and shape of a tensor?
Tensor.dim(), Tensor.shape
-
Mention 7 groups of tensor operations and give 1 example of each?
Arithmetic operations (e.g., addition)
Comparison operations (e.g., less than)
Logical operations (e.g., and)
Reduction operations (e.g., sum)
Transformation operations (e.g., reshape)
Generation operations (e.g., ones_like)
Indexing and slicing operations (e.g., tensor[0])
-
How do you slice a one-dimensional tensor?
tensor[start:stop:step]
-
What is the use of "axis" in tensor operation?
Specify along which dimension to apply the operation
-
Differentiate between squeeze and reshape operation?
Squeeze removes all 1's in original dimensions, while reshape changes the tensor to be of the specified dimension.
-
Name 4 places we can find tensors in deep learning models?
input data, activations of layers, weights and biases and gradients
-
Name 4 properties of images that qualifies them as tensors?
they have dimensions, they have a shape, they can undergo tensor operations, pixels are represented by numerical values
-
Mention 10 different tensor operations that can be performed on images?
- reshaping
- rotation
- flipping
- cropping
- filtering
-
What is data augmentation?
-
What is the benefit of data augmentation in model training?
-
Give 5 image preprocessing techniques that can form styles of data augmentation?
-
What is a dataset?
-
Name 5 modalities of data in a dataset?
-
What does it mean to feed data in batches?
-
What is the super class of a PyTorch dataset class?
-
Name 3 compulsory functions to implement in a PyTorch dataset class and their functions?
-
What is a dataloader?
-
What are the major considerations when building a dataloader?
-
What is a Convolutional Neural Network and what is it used for?
CNN is a type of a Deep Neural Network for local feature extraction at every layer. Meaningful features are learnt from small, localized regions of the input data.
The CNN architectures are primarily used for computer vision tasks but are not limited to them. Basically, CNN models can be used on all sorts of data, like text or audio, as long as the input can be split into features.
For example, an hierarchy to be learnt from an image can be: pixel -> edge -> texton -> motif -> contour -> object.
For text data, it can be: character -> word -> clause -> sentence -> story.
-
Mention at least 7 layers that can be found in a CNN?
-
What is a convolution?
-
Write the mathematics of a convolution operation?
-
Why is the size of the output of a standard convolution smaller than the input size?
-
How can you keep the size of input and output of a convolution the same?
-
How does convolution change the channel of a feature?
-
What is a convolution filter?
-
How does stride influence the number of parameters in a convolutional layer?
-
What is a receptive field?
-
Mention 10 types of convolution layers and their major characteristics?
-
Which type of convolution downsample an input feature size?
-
Which type of convolution upsample an input feature size?
-
Which type of convolution is mainly targeting the transformation of input feature channel size?
-
How is deformable convolution different from separable convolution?
-
How does MobileNet use fewer parameters than conventional CNNs?
-
When do you use a 1D convolution?
-
When do you use a 2D convolution?
-
When do you use a 3D convolution?
-
What is a pooling layer?
-
How many parameters does the pooling layer have?
-
Name 4 types of pooling operations and their effects on the features?
-
What is the mathematics of a fully-connected layer?
-
When do you use a fully connected layer?
-
How do you determine the input and output size of a dense layer?
-
What is a dropout layer?
-
What is the behavior of a dropout layer during training and during testing?
-
Given an inputs size and convolutional kernel and strides, what is the formula for computing the output size?
-
How do you compute the number of parameters of a dense layer?
-
How do you compute the number of parameters of a convolution layer?
-
How many parameters has a batch normalization layer?
-
What is the use of the init() in PyTorch model design?
-
What is the use of the forward() in PyTorch model design?
-
What is the super class of a PyTorch NN model?
-
What is a sequential model?
-
What is a functional model?
-
When do you prefer to use a functional model over a sequential one?
-
Where is the order of execution of a functional model determined?
-
Where is the order of model architecture of a sequential model defined?
-
What layers do you consider when you count the number of layers of a CNN model?
-
What is a residual connection?
-
Mention example model using a residual connection?
-
Mention 3 ways an identity input feature can be connected to the output in a residual connection?
-
Why do large deep learning models need residual skip connection?
-
How to compute derivatives of a function?
-
How does a deep learning model update its weights?
-
What are the basic deep learning operations that you can find in a forward pass?
-
What are the basic deep learning operations that you can find in a backward pass?
-
Why are the intermediate values of each layer are cached in a memory during a forward pass?
-
What types of features are learnt by early-stage layers of a CNN?
-
Differentiate between evaluation metrics and loss function?
-
Explain the following with regards to model evaluation: TP, TN, FP, FN?
-
How do you compute the average precision?
-
What is the formula for precision using TP, FP, FN?
-
What is the formula for recall using TP, FP, FN?
-
What is AUC?
-
What is backpropagation?
-
How are gradients computed during model training?
-
What is gradient descent optimization?
-
List steps for the gradient descent algorithm in proper order?
-
What is stochastic gradient descent?
-
What is batch gradient descent?
-
What is mini-batch gradient descent?
-
Why is mini-batch gradient descent preferred over stochastic one?
-
Why would you use mini-batch gradient descent over batch gradient descent?
-
What is an optimizer?
-
What are the additional features added by the optimizers?
-
Mention the 3 tasks of an optimizer?
-
Give 5 examples of optimizers you know?
-
What does it mean to have a batch size of 8?
-
Differentiate between an epoch and iteration step?
-
What are the impacts of large and small batch sizes on training convergence?
-
What is a learning rate?
-
What are the impacts of large and small learning rates on training convergence?
-
How do you select a learning rate value?
-
Mention 4 ways of performing hyperparameter search?
-
What is generalization?
-
In 5 steps, summarize the training loop?
-
What is the interpretation of a training with oscillating loss?
-
What is the interpretation of a training with diverging loss?
-
What is the interpretation of a training with stagnating loss?
-
What is the interpretation of a training with stable loss?
-
What is the interpretation of a training with decreasing loss?
-
Define overfitting and underfitting?
-
Mention 20 ways of overcoming overfitting in deep learning model training?
-
What is batch normalization?
-
Mention 4 parameters you learn with batch norm?
-
What is early stopping in model training?
-
What do you understand by model convergence?
-
What is transfer learning?
-
What do we train by transfer learning?
-
What are the basic 3 steps in transfer learning?
-
Define unsupervised pretraining?
-
What is finetuning?
-
What are the 4 finetuning configurations you were taught?
-
How is object detection different from object classification?
-
How is object detection different from object localization?
-
Mention 4 possible ways of localizing an object?
-
Interpret the box coordinate values of a localized object?
-
List 10 applications of object detections?
-
In 4 ways, differentiate between one-stage and two-stage detectors?
-
What are anchor boxes in object detection?
-
What is multi-scale in object detection and why is it useful?
-
What is non-maximum suppression?
-
What metrics in used in computing non-maximum suppression?
-
Give 2 examples of single-stage detectors?
-
Give 2 examples of two-stage detectors?
-
Explain the concept of feature pyramid in object detection?
-
What is region proposal in object detection?
-
What is ROI?
-
What is multi-task learning?
-
Explain IoU in the context of object detection?
-
In an object detection problem, what is the metric that defines the "quality" of an inference?
-
How is segmentation different from localization?
-
Mention at least 5 types of segmentation?
-
What is the difference between semantic and instance segmentation?
-
When can semantic segmentation be treated as binary segmentation?
-
What is panoptic segmentation?
-
Name 3 common techniques of upsampling in segmentation model?
-
Mention common layers you can find in the encoder layer of a segmentation model?
-
Mention common layers you can find in the decoder layer of a segmentation model?
-
What is unpooling?
-
What is the function of an encoder in a deep learning model?
-
What is the function of a decoder in a deep learning model?
-
Mention 5 examples of segmentation models you know?
-
Explain IoU in the context of segmentation mask evaluation?
-
What type of data requires the use of RNN over FNN?
-
Give 3 examples of a sequential problem?
-
How many states does a GRU have?
-
What is a cell function?
-
Mention two states of an LSTM cell?
-
Explain the unrolling of an RNN layer?
-
What is vanishing gradient?
-
How does an LSTM suffer from vanishing gradient?
-
How can vanishing gradients in LSTM be mitigated?
-
What is exploding gradient?
-
What is bidirectional RNN?
-
Mention and explain the functions of the 3 gates of an LSTM?
-
Name 3 non-RNN temporal models?
-
What is an attention mechanism in deep learning?
A mechanism that allows a model to focus on specific part of the input when producing an output.
- Why is attention mechanism important?
The attention mechanism is important because it allows the model to focus on the most relevant parts of the input, which improves the model's ability to understand and generate sequences, ultimately enhancing its performance and accuracy. (It can also help reduce unnecessary computation).
- How does attention mechanism overcome the issues in RNN?
By allowing the model to selectively focus on different parts of the input at each time step, attention mechanism enables the model to better understand the context, improving its performance
- What is sequence-to-sequence modeling?
Sequence-to-sequence modeling is a type of model that converts one sequence of data into another. It typically consists of an encoder that processes the input sequence and a decoder that generates the output sequence
- What is a basic building block of a transformer?
self-attention layer, layer normalization, feed-forward layer, residual connections
- Write the equation of self-attention?
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V
Q, K, and V are the query, key, and value, d_k is the dimension of the keys
- What are the advantages of multi-head attention?
capturing various sides of the input, improved expressiveness of the model and robustness in case one head attends to not-so-important part.
- What is a cross attention?
Cross attention is an attention that attends to parts of several different input sequences. The model attends to one sequence while considering another sequence. It typically involves a query sequence attending to key-value pair sequences from different inputs.
- What is the use of alignment score in attention process?
The alignment score measures the importance of different parts of the input
- Differentiate between local and global attention?
In local attention, the attention is placed on few input parts, and the global attention is placed on all input parts. It means that in global attention all the input is important for generating the context vector (a vector of representations for each word in the input)
- Differentiate between temporal and spatial attention?
Temporal attention focuses on the time dimension of the input, while spatial attention focuses on the spatial dimensions, such as the height and width of an image. Temporal attention is often used in machine translation and speech recognition, while spatial attention is often used in image classification and object detection.
- Differentiate between positional and channel attention?
Positional attention focuses on the position of each element in the input sequence (e.g. position of the words in a sentence), while channel attention focuses on the different channels or features of the input (e.g. RGB channels in an image).
- What is dot-product attention?
It's a type of attention that computes the attention scores by doing the dot-product of queries and keys.
- List and explain the different attention operations?
- dot-product
- multi-head
- scaled dot-product
- additive attention (uses a feed-forward neural network to compute the attention scores)
- self-attention (query, key, and value vectors all come from the same input sequence)
- cross-attention
- local
- global
- What is the use of SoftMax in the attention process?
The softmax function is used in the attention process to normalize the attention scores (regardless of how they are computed, such as through dot-product or additive mechanisms). This ensures that the attention weights sum up to 1, converting them into a probability distribution. This normalization helps the model to focus on the most relevant parts of the input sequence.
- What is the function of positional encoding in a Transformer?
Positional encoding is used to encode the information about the position of an element in the input sequence (e.g. token in a sentence) with the information about the element itself. The position of an element in a sequence is important because it helps to know how much of attention we should give this element. It is especially useful in NLP tasks, as different orders of words in a text can result in totally different meanings.
- What is the use of masked attention in a Transformer?
Masked attention is used in the decoder part of a transformer to prevent the model from attending to future tokens that have not been generated yet. This ensures that the prediction for each position in the output sequence depends only on the known outputs up to that position. We set the attention scores of future tokens to negative infinity (which results in a probability of zero after applying the softmax), to ignore them during the attention computation
- How can an image be treated as a sequence?
Each pixel of an image can be treated as an element of a sequence. Another option is for the image to be split up to patches of a certain dimension. These patches are then treated as elements of a sequence.
- List at least 5 popular transformer models and their tasks?
BERT: question answering, NER, sentiment analysis
GPT: text generation, machine translation
T5: translation, question answering
ViT: image classification
DETR: object detection
- What are the advantages of a transformer over an RNN?
Transformers are faster as they process sequences as a whole and not sequentially like does an RNN. They're therefore highly parallelizable and more efficient during training on GPUs.
Transformers use attention mechanism which allows to capture the dependencies of different parts of an input sequence more effectively (because each token can attend to any other token in the sequence).
- What is the complementary model needed to train a GAN?
- Explain the concept of min-max in adversarial learning?
- What is a diffusion process?
- What is mode collapse?
- What are the main issues with training a GAN?
- How can you stabilize the training of a generative model?