### **ANNs**

**Q :** What are ANNs?

* ANN = Artificial Neural Network
* they represent a family of non-linear models
* inspired by biological neural network
* well structured and sufficiently large customized NNs can approximate any function one can think of


**Q :** "ANNs are flexible models". Justify.

Whether the data 

* has static patterns
* contains multi-dimensional inputs or
* is sequential in nature

ANNs can be applied for modelling. Hence, they are dominant among most ML models.

**Q :** Describe the basic structure of an artificial neuron.

<img src="Images/artificialneuron.png" alt="image description" width=500 height=200>

1. **Inputs $(x_1, x_2, \ldots, x_n)$**:
   - These are the signals or values fed into the neuron
   - Each input represents a feature of the data that the neuron processes

2. **Weights $(w_1, w_2, \ldots, w_n)$**:
   - Each input $x_i$ is associated with a weight $w_i$
   - weight of a feature determines the importance or contribution of that feature to the neuron
   - These weights are learned during training of the NN

3. **Transfer Function (Σ)**:
   - The transfer function computes the *net input* 
   - by summing the weighted inputs and adding a bias (if present)
   - Mathematically, this is:
     $$\text{net}_j = \sum_{i=1}^{n} w_i \cdot x_i + b$$
   - This **summation process gathers all inputs into a single value**, which will then be passed to the activation function

4. **Activation Function $\phi$**:
   - The activation function 
      * takes the net input &
      * transforms it 
      
      determining the final output $o_j$ of the neuron 

5. **Threshold $\theta_j$**:
   - In some models, a threshold $\theta_j$ is set, which the net input must exceed for the neuron to activate. 
   - If the net input is below the threshold, the output might be set to zero or a baseline value, making it a conditional activation.



**Q :** Define transfer functions in the context of an artificial neuron.

It is a linear combination of weighted inputs


**Q :** Give examples of commonly used activation functions.

* sigmoid
* ReLU
* tanh

**Q :** What is the role of activation function in an artificial neuron?

* They introduce non-linearity, enabling the neuron to learn complex patterns
* activation function's output defines whether and to what extent the neuron "fires" or activates, influencing the signal passed to the next layer in the network

**Q :** List out the type of commonly used artificial neurons.

| Neuron Name         | Brief Description                                                                                       |
|---------------------|--------------------------------------------------------------------------------------------------------|
| Perceptron          | <ul><li>produces binary outputs</li><li>used for linearly separable problems</li><li> $\phi \ =$     step function</li>               |
| Sigmoid Neuron      | <ul><li>$\phi \ =$ sigmoid</li><li>outputs a continuous value between 0 and 1</li></ul>                   |
| ReLU Neuron         | <ul><li> $\phi \ =$ ReLU</li> <li>outputs zero for negative inputs and the input itself if positive </li></ul>|
| Tanh Neuron         | <ul><li>$\phi \ =$ tanh</li> <li>outputting values between -1 and 1</li><li>helps center data</li></ul>              |
| Softmax Neuron      | Outputs a probability distribution over multiple classes, commonly used in classification tasks.       |
| Spiking Neuron      | Mimics biological neurons with time-based activations, used in spiking neural networks (SNNs).         |
| RBF Neuron          | radial basis function (Gaussian) as the activation, effective for pattern recognition tasks.    |
| LSTM Neuron         | Used in Long Short-Term Memory networks, designed to handle sequential data and retain long-term memory.|
| GRU Neuron          | Used in Gated Recurrent Units, a simpler alternative to LSTMs for processing sequential data.          |
| Convolutional Neuron| Used in CNNs, applying convolution operations to detect spatial features in images.                    |


**Q :** What is the problem that bothers learning process of multiple perceptrons connected to each other?

* simple perceptron algorithm cannot be extended to these
* simple gradient based optimization is not applicable to these

**Q :** Why so ?

* Derivative of the step function vanishes almost everywhere, except at the origin

**Q :** Suggest methods to solve these problems.

* replace the step function with smooth non-linear functions which are differentiable
* use back propagation algorithm

**Q :** Give smooth approximations to the step function



**Q :** Define a layer in an ANN.

* Layer refers to a group of multiple neurons that work together as a single unit
* However, neurons in a layer are not interconnected with each other
* neurons in different layers are interconnected according to the network architechture
* mathematically, a layer = an intermediate vector in computation

### **Backpropagation**

**Q :** What is meant by module in a neural network?

* it is the input-function-output segment in a neural network
* any neural network is composed of several such modules
* $$x \to \boxed{\ y \ = \ f(W\cdot  x + b)} \to y$$
* W denotes the weight vector of a layer, b denotes the bias factor of the layer and f is the non-linear activation function of the layer 
* In a neural network, each layer/module typically consists of a linear transformation (using weights and biases) followed by a non-linear activation function.

**Q :** What is the role of activation function in a layer?

* The activation function introduces non-linearity, enabling the network to learn complex mappings beyond just linear relationships.

**Q :** Given an objective function Q(.) of the neural network, what is the error signal of any module?

* it is denoted as $e$
* it is the partial differential of the objective function w.r.t the output of the module
* $$e = \frac{\partial(Q)}{\partial(y)}$$
* This error signal e represents how much the output y affects the overall loss Q.

**Q :** What is the use of error signal?

* it enables the calculation of gradient of Q w.r.t all learnable parameters of the network
* by means of a simple chain rule, we have :
$$\frac{\partial(Q)}{\partial(W)} = \frac{\partial(Q)}{\partial(y)}\frac{\partial(y)}{\partial(W)} = e\frac{\partial(f(W\cdot x+b))}{\partial(W)} = e f'(W\cdot x + b) x$$

**Q :** How are the parameters of a neural network learned ?

* loss function or objective function of the network is designed as a function of the network parameters
* by optimizing this objective function w.r.t the parameters, the network learns the value of the parameters

**Q :** What is AD ?

* AD stands for Automatic Differentiation
* it is a technique that is guaranteed to compute the gradient( of the objective function w.r.t network parameters ) in most efficient way for any NN

**Q :** How does AD work?

1. $$\text{Neural Network} = \boxed{\text{Composition of simpler functions, each of which is a network module}}$$
2. AD technique passes some key messages (error signal) along the neural network
3. During this passage, all gradients can be computed locally, within the scope of each of the modules using these messages

**Q :** List out the working modes of the AD technique.

1. Forward accumulation mode
2. Reverse accumulation mode

**Q :** Reverse accumulation mode of AD is enabled via. ___ algorithm.

* Error Backpropagation

**Q :** How to generate error signals for all modules in a NN in RAM of AD ?

* We have to propagate th error signal along the neural netwrok in a certain way
* propagation should be from the output end of a module to the input end of the module, so that it can be used as the error signal for the previous module
* While we travel along a neural network in the reverse direction, we observe that :
$$\boxed{\text{ Input of the current module }} = \boxed{\text{ Output of the previous module}}$$
* the propagation of the error signal is continued till the 1st module of the network is reached.

**Q :** Consider this module of the NN given below.

$$\text{input }x \to \boxed{\ y \ = \ f_W(x)} \to \text{output }y$$

Represent mathematically the error backpropagation for this module if e is its error signal.

**A :**
$$\frac{\partial(Q)}{\partial(x)} = \frac{\partial(Q)}{\partial(y)}\frac{\partial(y)}{\partial(x)} = e\times \frac{\partial(f_W(x))}{\partial(x)} $$

* Here, $e$ is the error signal of the current module using which we get $\frac{\partial(Q)}{\partial(x)}$ viz. the error signal of the previous module

**Q :** Let's discuss the more general and realistic case where x and y are vectors. Say $x\in R^m$ & $y\in R^n$.

**A :** Let $x = (x_1,x_2,\dots, x_m)$, $y = (y_1, y_2, \dots, y_n)$ be the input and output vectors to a layer

**Q :** Give the pseudocode to understand how error backpropagation helps the model to learn its parameters.

**ERROR BACKPROPAGATION PSUEDOCODE**

0. Initialize network parameters (weights and biases) randomly

**Forward pass**
1. Take a training sample (x, target) & input x to the network
2. For each layer in the network:
    - Calculate the output of each neuron using its weights and biases
    - Store the output for use in backpropagation

**Compute the Error at the Output Layer**

3. Calculate the error at the output layer using the loss function:
  $$\boxed{\text{error} = \text{Loss(output, target)}}$$

**Backward Pass (Backpropagation)**

4. Calculate the error gradient for the output layer:
    - Compute the error signal for each neuron in the output layer
    - $$\boxed{\text{Error signal }(e) = \text{ derivative of Loss with respect to output layer activations}}$$
5. Propagate the error backward through each layer, starting from the output layer and moving backwards by :
    - Calculate the gradient of the error with respect to each weight in the layer
    - $$\boxed{\frac{\partial(Q)}{\partial(W)} = \text{error signal * derivative of activation function * previous layer output}}$$
    - Store the gradient for each weight and bias in this layer

**Update Parameters**

6. For each layer, for each weight W in that layer and bias b of the layer:
    - Update W using gradient descent:
    $$W = W - \eta  \frac{\partial(Q)}{\partial(W)}$$
                
    $$b = b - \eta \frac{\partial(Q)}{\partial(b)}$$
7. Repeat steps 1-6 for all training samples to complete one training epoch

**Optionally track and print loss to monitor training progress**

8. Calculate and print total loss for the epoch if needed
9. Repeat until convergence or maximum number of epochs reached


### **Gradient Problems**

**Q :** Deep neural networks are so powerful models. Although the theory behind such networks were found in the mid 10th century, they became popular only after 2010. Why did it take so long for them to be popular?

* training deep nets can be hard
* training is done using the backpropagation algorithm
* this method can lead to problems like :
   - vanishing gradient
   - exploding gradient
* they prevent the smooth & successful training of the model
* till 2006, there was no way to accurately train deep ANN due to vanishing gradient problem
* till 2006, the deep NNs performed poorly in comparison with shallow NNs and other ML models

#### **Breakthrough Research Works**

**Q :** RBMs stands for ____.

* Restricted Boltzmann Machines

**Q :** Which research paper introduced the concept of RBM ?

* **Paper :** "Training Products of Experts by Minimizing Contrastive Divergence"
* **Author :** Geoffrey E. Hinton
* **Year :** 2002
* **Brief :** 
   - describes the RBM 
   - introduces the contrastive divergence algorithm, a fast and efficient way to train RBMs
   - foundation for many deep learning architectures 
   - catalyzed the resurgence of neural networks
   - led to the development of deep belief networks (DBNs) and other deep learning models. 

#### **Problems**

##### **Vanishing Gradients**

**Q :** Define vanishing gradient problem.

**Q :** List out the causes of VG problem.

**Q :** What are the consequences of VG problem.

**Q :** Explain the math behind VG problem.

**Q :** What are the solutions that will help to mitigate the VG problem?

##### **Exploding Gradients**

**Q :** Define exploding gradient problem.

**Q :** List out the causes of EG problem.

**Q :** What are the consequences of EG problem.

**Q :** Explain the math behind EG problem.

**Q :** **Q :** What are the solutions that will help to mitigate the EG problem?

### **Some Clarifications**

* any single layer is composed of many neurons
* each neuron has different parametes and (or) activation function
* each neuron independently generates a single scalar output
* all the scalar outputs produced by all neurons in a layer are organized as the components of a vector
* this vector that comprises scalar outputs from the previous layer's neuron will act as the input vector to each and every neuron in the next layer
* each neuron in the next layer will process this vector independently and generate its own output

### **RNNs**

**Q :** What are RNNs?

* a type of neural network designed to handle sequential data
* processes inputs one step at a time
* each step's output is influenced by previous steps
* allows it to retain information from earlier in the sequence
* useful for tasks like 
    * language processing 
    * time series analysis 
    
However, RNNs struggle with long-term dependencies due to issues like the vanishing gradient problem.

**Q :** List out the drawbacks of traditional ANNs that led to the discovery of RNNs.

1. Traditional neural networks are not able to deal with variable length inputs
2. They also do not consider the sequential order of the inputs

###### **TIME DELAYED FEEDBACK**

**Q :** There are several ways to connect 2 layers of a neural network. What is time-delayed feedback connection?

* it is a simple strategy to introduce memory mechanism into NN
* done by adding some time delayed feedback paths to the NN
* TDFPs is used to send the status of a layer **y** back to a previous layer as a part of its next input
* NNs containing such FPs are called RNNs

**Q :** What are the components of a time-delayed feedback connection?

1. **Input Signal**: 
   * original data or sequence step that is fed into the network at each time step 
   * e.g., a word in a sentence or a data point in a time series

2. **Recurrent Connection**: 
   * connection through which the output from one time step (or the hidden state) is passed as input to the next time step
   * This creates a "loop" that allows the network to retain information from previous steps

3. **Hidden State**:
   * internal representation (usually a vector) that holds information about previous steps, updated at each time step to reflect new input and past context
   * critical to maintaining memory over a sequence

4. **Output Signal**: 
   * final processed **output from the current time step**, which can be used for predictions or further layers
   * and which also feeds back into the next step as part of the recurrent connection

In advanced RNNs like LSTMs and GRUs, additional **gating mechanisms** (like forget, input, and output gates) are included to control how much information from past steps is retained or discarded in the hidden state, further refining the feedback loop.

**Q :** Give an analogy to better understand a feedback path.

Analogy : a **teacher giving a series of lessons to a student**, where each lesson builds on the last to deepen the student’s understanding of a topic


|Component|Analogy|
|---|----|
|Input signal|today's lesson, the new lesson material or note shared by the teacher|
|Recurrent connection|using the understanding of the past lesson for understanding today's lesson by the student|
|Hidden state|student's notebook or memory that is getting updated after each class|
|Output|student's understanding after today's class viz. derived from the notes given by the teacher|
|gates|review and focus areas|


1. Every day, the teacher introduces a new concept or lesson notes

2. As the teacher presents each lesson, the student uses his/her understanding of the previous lesson for better understanding that lesson.

3. The student keeps a notebook and has a mental understanding of what’s been learned so far.

4. After each lesson, the student’s current level of understanding is like the output. This comprehension is a result of integrating today’s new lesson (input) with all the prior knowledge (hidden state) and influences their performance in that day’s classwork or questions.

5. Sometimes, the teacher reviews certain critical concepts, reinforces essential ideas, or lets go of less important points. 



<img src="Images/timedelayfeedback.jpeg" alt="image description" width=400 height=400>

**Q :** In the above diagram, 

* $y_t$ = output of the network at the current time step t
* $x$ = current input
* This is the result of processing the current input x along with the feedback from the previous output $y_{t-1}$ 

​Essentially, $y_{t-1}$ provides the "memory" of what was produced at the last time step and is used as context for generating the current output $y_t$.

**Q :** In the above diagram, ___ is a delayed version of $y_t$?

* $y_{t-1}$

**Q :** Why is this feedback loop called as **time-delayed**?

* term **"delayed version of $y_t$"** means that the output $y_t$, which is produced at the current time step $t$, is not used immediately in the feedback loop
* it is **held back (or "delayed") by one time step** and only fed back into the model during the next time instance $t+1$

1. **At time $t$**, the model produces an output $y_t$ based on the current input $x$ and the feedback from the previous output $y_{t-1}$
2. This output $y_t$ is stored and **"delayed"** for one time step
3. **At time $t+1$**, the output $y_t$ (now considered as $y_{t-1}$ from the model's perspective is fed back as part of the input to help generate the new output  $y_{t+1}$.

The "delay" simply means that **each output is used as feedback in the following step, not immediately in the same step**.

**Q :** Give an analogy to represent this delay mechanism.


Imagine you’re writing a diary every night about what you learned each day:

1. On **Monday**, you write down what you learned that day. That’s your "output" for Monday.
2. On **Tuesday**, you review Monday’s notes (the delayed version of what you wrote on Monday) to help you understand Tuesday's new information better.
3. On **Wednesday**, you look back at Tuesday’s notes, and so on.

Each day's understanding (output) becomes useful **only on the following day**; that’s the "delay". Similarly, in a neural network, $y_t$ is delayed by one time step and used in the next cycle as $y_{t-1}$

###### **TAPPED DELAY LINE**

**Q :** What is tapped delay line connection?

* another strategy to introduce memory mechanism to NN
* no recurrent feedback is used in this strategy
* a TDL is a number of synchronized memory units aligned in a line


**Q :** What is the meaning of a delay line in this context ?

* term "tapped delay line" comes from signal processing and electronics
* it originally referred to a physical line or medium (like a wire or a series of circuits) that could store and delay a signal for a certain amount of time
* Delay Line" refers to a series of storage elements that hold onto a signal temporarily, creating a delay
* its like a sequence of “buckets” that each hold the signal for one time step, passing it along to the next bucket after each step
* In neural networks, this "line" delays each input over a few time steps, creating a short-term memory of past inputs

**Q :** Tapped means?

* tap = to make use of a source of energy, knowledge, etc. that already exists
* tap is a point along the delay line where you can access or "tap into" the stored signal
* Imagine placing sensors along a water pipeline that capture the flow at various points
* each tap represents a specific point in the past (like 1 step ago, 2 steps ago, etc.)
* in a tapped delay line, multiple points (taps) are set along the delay sequence, allowing the network to access different past states at each tap
* In neural networks, these taps provide the model with access to several past inputs simultaneously, enriching the current decision with a range of historical information

In short, the term “tapped delay line” reflects the idea of having multiple "taps" along a delayed sequence to capture different points in time, creating a memory window of past inputs.

**Q :** Explain the function of a tapped delay line using the diagram.

<img src="Images/tappeddelay.jpeg" alt="image description" width=850 height=250>

* $y_t$ represents the input or the value fed into the layer $Y$ at the current time step $t$
* $z^{-1}$ represents the delay of input by one-time step
* $a_0, a_1,\dots$ represent the weights of the delayed versions of previous inputs in the current output
* $\hat{z}_t$ is output of the tapped delay line, which is a combination of the current and past values of y after applying the respective weights

Using this setup, the tapped delay line output $\hat{z}_t$ at time $t$ is a weighted sum of y at the current and previous time steps. Mathematically, this can be expressed as:

$$\hat{z}_t = a_0y_{t}+a_1y_{t-1}+a_2y_{t-2}+...a_{L-1}y_{t-(L-1)}$$

where $L$ is the length of delay line. This allows the tapped delay line to maintain a form of temporal memory over multiple time steps, where the influence of each past value can be controlled by its associated weight.

**Q :** Describe the whole concept for thorough understanding.

1. The memory units in the diagram store the values of layer y at previous time instances as the length of delay line ie. $y_t, y_{t-1},\dots$
2. At next time instant $t+1$, all values saved in the memory units are shifted right by 1
3. At any time instance t, all the stores values are linearly combined through some learnable parameters to genrate a new layer of outputs denoted as $\hat{z}_t$
4. the learnable parameter $a_i$ can be chosen as :
    * a scalar
    * a vector
    * a matrix
5. $y_t, y_{t-1}, \dots$ are vectors (vector representation of words)

**Q :** An important aspect of tapped delay line is that the generated vector $\hat{z}_t$ will be  sent to the next layer. It is basically a non-recurrent feed forward structure that does not introduce any cycle into the neural network. Justify.

* the **previous inputs** are only stored in the memory, not the **previous outputs**
* inputs always remain at the input end only, so no need to feed anything back
* output is fed into next layer not the previous layer, hence feed forward structure

**Q :** Modify the previous analogy to explain the tapped delay line connection.

|Component|Analogy|
|----|----|
|Input|teacher notes|
|output|students understanding|
|delay line|file of paper notes(oldest paper is removed and newest note paper is added)|

1. Teacher notes from previous days are used for learning the today's class
2. The student does not use what his/her understanding or test feedback from previous classes for understanding today's class



**Q :** Suppose the length of delay line is k & the number of tokens to process the entire corpus by the TDL is N. Then after training, how many model parameters will be obtained?

* In a tapped delay line model, each tap position has an associated weight, meaning you have k weights that correspond to each tap position in the delay line
* These weights are applied across all training examples, which means that for each new set of taps encountered during training, the model uses the same weights
* During training, as each set of k taps are processed the weights are adjusted slightly based on the loss or error signal.
* However, these weights are updated in such a way that they generalize across the entire dataset
* After processing one set of taps, the weights are incrementally modified, but they remain the same weights for the next set of taps
* This process repeats for each of the N training examples, continuously refining this single set of k weights
* At the end of training, the model retains only one set of k weights. These weights represent the learned importance of each tap (delay) across the entire training set

**Q :** Why only a single setof weights?

* The tapped delay line model aims to learn a generalizable pattern in the data rather than memorizing specific weights for each tap set.
* By having a single set of k weights, the model can apply the learned importance of each time delay position consistently across different input sequences.

###### **ATTENTION**

**Q :** Attention connection is a modified version of tapped delay line. Justify.

* the coefficients $\{a_0,a_1, \dots\}$ are all learnable parameters in tapped delay line
* in a TDL, once the parameters are learned, they remain constant as network parameters
* in AM, the aim is to dynamically adjust these coefficients to select the most prominent feautres from all saved info based on :
   1. the current input token to be processed &
   2. present internal status of the network
* Unlike TDF with a fixed window, attention doesn’t impose a strict cutoff on the number of important tokens. Instead it saves previous tokens at any instant of time
* Unlike TDLs, where some inputs are discarded after a set number of delays, attention mechanisms retain and can weigh all tokens, which is especially valuable for capturing long-term dependencies 
* Instead of a limited window, attention allows each token to be informed by the entire sequence, so the model can dynamically focus on whichever parts of the sequence are most relevant, regardless of distance
* the model has access to all previous (and even current) inputs at any point in time. Each input token (or word) can "attend" to any other token in the sequence, regardless of how far back it appeared.

In attention mechanism, all previous inputs are available and can be “stored” at any instance of time — a stark contrast to the fixed memory limitation in a TDL.


**Q :** At any instance of time, all previous inputs are avaliable. What about the weights of these tokens for processing the current token?

* while all previous inputs are technically accessible, the model assigns different levels of importance (or weights) to each stored input when processing the current input
* each stored input from the sequence is given a different weight based on its relevance to the current input. These weights are called attention scores
* While technically all tokens are considered, only a subset may receive significant weight and, therefore, contribute meaningfully to the understanding of the current input.
* Inputs that are less relevant to the current input will receive lower weights, and thus, their impact will be minimal or even negligible
* All tokens are "eligible" to be attended to, but only those with higher weights (attention scores) will actually influence the processing of the current token.
* self-attention mechanisms like in Transformers use techniques like **softmax normalization** to ensure all tokens contribute but with different weights. This allows the model to focus on the few most relevant tokens while still having access to the full context.

**Q :** Explain the working of a single attention layer using a diagram.

<img src="Images/attention.jpeg" alt="image description" width=900 height=250>



1. **Input Sequence**
* The top row represents the sequence of inputs y at different time steps; $y_t, y_{t-1},\dots$ 
* each value in this sequence represents a stored or past value that is relevant for generating the output at time t
* seq represents a sentence (sequence of words as embeddings)
* each $y_{t-i}$ will be an embedding of a word in the sentence being processed by the layer


2. **Delayed signals**
* the signal $y_t$ is passed through a series of delay units (denoted as $z^{-1}$)  which shift the values back in time, enabling the layer to access historical data at each time step, maintaining multiple past values in memory

3. **Attention weights**
* While the layer is processing any sentence, at any instance of time t, it computes weights ${a_0,a_1,\dots}$ for the previous tokens in that sentence based on the current token or context

4. **Output**

* $$\hat{z}_t = a_0(t)y_t + a_1(t)y_{t-1}+...$$
* The weighted sum operation combines the historical values using their respective coefficients
* The sum is then passed to the next layer or used for further processing, providing a contextually weighted output at each time step


**Q :** What is the significance of the dynamic nature of the attention weights ?

* dynamic nature means we compute the attention weights at each instance while processing a sentence.
* these weights reflect the relevance or importance of each historical input $y_{t-i}$ at the current time t
* Unlike fixed weights, these coefficients can change for every time step, allowing the model to focus on different parts of the sequence as needed.

**Q :** What happens after a sentence gets processed by the attention layer or TDL? Point out the difference.

**Attention**: Each sentence is processed in isolation, with all previous tokens within the sentence available through dynamically computed attention weights. It can capture long-range dependencies within a sentence.

**Tapped Delay Line**: Each sentence is also processed in isolation, but only with a fixed number of recent tokens available (determined by the taps). It only captures short-range dependencies and uses static weights, which makes it less adaptive and unable to retain long-term context.

|Aspect|Attention layer|TDL|
|---|---|----|
|Handling sequential data|processes each sequence independently|same|
|Memory & dependency range|long|short|
|weights|dynamic & content based|fixed|
|cross sentence info|does not carry info across sentences unless a document level task|same|

**Q :** What is attention function ?

* denoted as "$g$"
* inputs : Query vector(Q), Key vector(K), Value vector(V)
* outputs : $softmax\left(\frac{QK^t}{\sqrt{d_k}}\right)V$
* purpose : to compute attention scores
* time dependency : The attention weights are not fixed; they are recomputed at each time step based on the current query and the full set of keys. This means that the attention weights change depending on where the model is in the sequence and what information is currently most relevant.

**Q :** What is query vector? What is its significance?

* it represents the current input token’s perspective
* it captures what the current token seeks from other tokens to enhance its contextual understanding
* It interacts with key vectors (representing other tokens) to compute attention scores, which determine the relevance of each token to the current input. 
* The significance of the query vector lies in its role in dynamically focusing on relevant information, enabling context-aware representations in tasks like language understanding.


**Q :** How is query vector computed? What is it's dimension?

1. From a corpus, its embedding matrix is calculated, say X.
 $$X_{N\times d} \ \begin{cases}N - \text{number of unique words in the corpus}\\ d - \text{dimension of the word embeddings}\end{cases}$$
2. Then a matrix of parameters $W_Q$ comes into picture. It will act as a linear transformation that transforms the word embedding into a query vector with $d_k$ components
$${W_Q}_{d\times d_k}\begin{cases}d - \text{dimension of the word embeddings}\\ d_k - \text{dimension of the query vectors}\end{cases}$$
Using $W_Q$, the query vector corresponding to the current token (an embedding vector) can be calculated

**Q :** What is key vector? What is its signifcance?

* it represents a feature or characteristic of an input element
* it serves as a point of reference for determining relevance during the attention calculation
* It helps to identify which values (or value vectors) should be attended to, based on their similarity to the query vector. 
* Essentially, the key vector signifies the importance of an input element or token in relation to a specific query in the sequence processing

**Q :** How is query vector computed? What is it's dimension?

1. From a corpus, its embedding matrix is calculated, say X.
 $$X_{N\times d} \ \begin{cases}N - \text{number of unique words in the corpus}\\ d - \text{dimension of the word embeddings}\end{cases}$$
2. Then a matrix of parameters $W_K$ comes into picture. It will act as a linear transformation that transforms any word embedding into a key vector with $d_k$ components
$$dim(W_K) = dim(W_Q) = d\times d_k$$
Using $W_K$, the key vector corresponding to the current token (an embedding vector) can be calculated

**Q :** What is value vector? What is its significance?

* it contains the actual information or data associated with an input element in an attention mechanism
* It is used to produce the output of the attention process by weighing the importance of each value based on the similarity between the query vector and the key vectors
* The significance of the value vector lies in its role in delivering the relevant information that contributes to the final output, depending on how much attention is given to each input
* value vector of a word in the context of attention mechanisms can be considered similar to its embedding
* However, in the attention mechanism, the value vector specifically **serves to convey the information that is weighted according to the attention scores**, while embeddings can be used in various contexts beyond attention, such as in input representations for neural networks
* Both represent the word in a continuous vector space, capturing semantic information and relationships with other words.

**Q :** How is value vector computed? What is it's dimension?

1. From a corpus, its embedding matrix is calculated, say X.
 $$X_{N\times d} \ \begin{cases}N - \text{number of unique words in the corpus}\\ d - \text{dimension of the word embeddings}\end{cases}$$
2. Then a matrix of parameters $W_V$ comes into picture. It will act as a linear transformation that transforms the word embedding into a value vector $d_k$
$${W_V}_{d\times d_v}\begin{cases}d - \text{dimension of the word embeddings}\\ d_v - \text{dimension of the value vectors}\end{cases}$$
Using $W_V$, the value vector corresponding of any token (an embedding vector) can be calculated

**Q :** Using a flow chart, illustrate the calculation of attention scores.

1. Query vector for the current token is calculated
2. Key vectors of all previous tokens in the sentence are calculated
3. Value vectors of all previous tokens in the sentence are calculated
4. for each key vector :
    - take dot product of it with the current query vector to get its attention score
    - scale the score by dividing with $d_k$ viz. the dimension of the query & key vector
5. all the calculated scores are fed into a softmax function. It is a multivariate vector-valued function.
$$\text{all attention scores } \to \boxed{\text{ softmax function }} \to \text{ normalized scores = attention weights}$$
$$(\text{score}_i)_{i=1}^t \to \boxed{\text{ softmax function }} \to \left(\frac{e^{score_i}}{\sum_{j=1}^te^{score_j}}\right)_{i=1}^t$$
t denotes the total number of previous tokens at time $t$.

6. Value vectors of all the previous tokens are linearly combined using their current attention scores to get the output. This linear combination represents the context vector for the current token, capturing relevant information from the previous tokens.

$$\text{Output at time t} = \sum_{i=1}^t\text{weight}_iV_i$$

7. The resulting output vector represents the current token, enriched with information from other tokens in the sequence. If needed, this **output can be considered as the updated version of the current token** (as earlier it was represented using a context-free embedding) & can be passed through a feedforward neural network or an activation function (like ReLU) for further processing
    

**Q :** We have seen the calculation of attention weights. Why is it time-dependent?

* at each time step, a new token is being considered for processing by the layer, hence a new query vector is introduced at each time step
* any query vector will interact differently with key vectors, hence attention scores will get changed at each time step
* remember the key vector signifies the importance of an input element or token in relation to a specific query in the sequence processing. The specific query is different at any instance of time
* each time a new word appear, the contextual importance of other input words changes, this change is captured using the attention scores
* hence at every time step, context keep changing so does the attention weights

**Q :** If attention weights keep changing at every time step, then what is the layer trying to learn during training?

- The attention mechanism computes the time-varying coefficients (attention weights) based on **query, key, and value** representations of the input data.
- These representations are obtained by **linear transformations** using weight matrices, $W_Q$, $W_K$ and $W_V$, which are learnable parameters.
- During training, the model learns the values of these matrices that will allow it to compute queries, keys, and values effectively for the task at hand
- The model learns to **adjust the query, key, and value transformations** so that the resulting attention weights highlight the most relevant parts of the input sequence for each output step.
- In many cases, the output of an attention layer is passed on to further layers (such as in a multi-layer transformer). 
- If the attention mechanism is part of a larger model, such as a transformer encoder-decoder, the parameters in these additional layers are also learned during training.


**Q :** Explain the significance of the linear transformations $W_Q$, $W_K$ & $W_V$.

* While the attention coefficients vary depending on the input and context at each time step, the weight matrices $W_Q$, $W_K$, and  $W_V$ that determine how to construct the queries, keys, and values are learned and fixed after training. 

* These learned parameters **define the mechanism for computing attention weights**, enabling the model to **generalize to unseen sequences by dynamically adjusting focus** based on the learned transformations. 

* these transformations define how each input influences others in context giving flexibility in handling sequences of varying length and structure.

**Q :** After training, the transformations $W_Q$, $W_K$ & $W_V$, remain constant model parameters. For any unseen sentence, the word embeddings of words in it also do not change. Now that we have discussed the calculation of Q, K & V, do you think there is a redundancy while calculating K & V for finding out attention scores? What is happening in reality?

Let, "cat sat on the soft mat" be the sentence to be processed by a pre-trained attention layer.

1. For each token in the sentence the model looks up the corresponding word embeddings using the fixed, shared embedding matrix. This embedding lookup happens **only once per token per sentence**.

2. Since, the $W_K$ of the pretrained model is fixed, and each word embedding in this sentence is fixed the key vector for each token will remain constant throughout the processing of this sentence.

3. 
    - When processing "cat" (first word), the model generates its query, key, and value vectors.
    - When moving to "sat" (second word), it generates a new query vector for "sat" but **reuses the already-computed key and value vectors** for "cat" since those vectors don’t change within this sentence.
    - For "on" (third word), the process is the same, the model generates a query for "on" and reuses the key and value vectors for both "cat" and "sat"

**NOTE :** Even though the key and value vectors for each word stay fixed for that sentence, the **attention scores and context vectors can still vary with each new word**. This happens because the query vector for each word changes, creating different interactions with the fixed key vectors.

**Q :** Extend the student-teacher-lesson analogy discussed previously into the attention layer concept.

* The teacher provides all past lessons’ notes to the student
* for each new lesson, the student selectively chooses which previous notes to review
* The student doesn’t have to focus only on recent lessons; instead, they can scan through all previous notes, applying varying levels of attention to each based on relevance
* The student’s “attention” isn’t limited to recent notes; they can dynamically weight notes from any point in time depending on their relevance to the current lesson 
* This flexibility allows the student to draw connections from any past lesson to make sense of the current one, which enables deeper and broader understanding without the rigid restrictions of a tapped delay line or the sequential dependency of time-delayed feedback

###### **SUMMARIZE**

**Tapped Delay Line** : Limited memory window, simple retention of recent notes only

**Time-Delayed Feedback** : Cyclic, cumulative understanding, sequentially built

**Attention Mechanism** : Non-sequential, selective attention to all past notes based on relevance to the current lesson, enabling flexible and contextual understanding

**Q :** Explain the simplest possible RNN with a diagram.

<img src="Images/RNN_folded.jpeg" alt="image description" width=400 height=400>

* this simple RNN has only 1 hidden layer viz. represented as **a**
* the hidden layer uses hyperbolic tan as the activation function
* there is a time-delayed feedback path viz. shown in red
* this path stores the current value from the hidden layer **h** at each time and delays it, so that it can be sent back to the input layer to concatenalte with a new input arriving at the next time instance


**Q :** What happens if this RNN is used to process the sequence of input vectors $\{x_1,x_2, \dots, x_T\}$, provided, $h_0$ represents the initial status from the hidden layer?

* Let $t$ denote the time steps while which the RNN is running. Then $t\in \{1,2,\dots, T\}$
* At any instance of time , the input vector reaches the hidden layer after a linear transformation as follows :
$$a_t = W_1[x_t; h_{t-1}] + b_1$$

* this linearly transformed input undergoes a non-linear transformation in the hidden layer as follows :
$$h_t = tanh(a_t)$$

* $h_t$ will be sent to the output layer to generate the output as :
$$y_t = W_2h_t + b_2$$

* $h_t$ will also be stored and delayed in the feedback path and will be concatenated with the next input $x_{t+1}$ during the next time step viz. $t+1$

* $W_1$, $W_2$, $b_1$ & $b_2$ denotes the parameters used in the 2 full connections of the RNN

**Q :** Explain this diagram below.

<img src="Images/RNN_unfolded.jpeg" alt="image description" width=1000 height=300>

* convinient way to analyze the behavior of the recurrent feedback in an RNN is to unflod the recursive computation along the time steps for the whole input sequence.
* RNN = duplicated non-recurrent networks for every time instance, each passing a message to its successor

**Q :** What is the major problem with a RNN.

* if RNN is used to process a long input sequence, this non-recurrent network becomes very deep
* if T is very large, the length of the above unfolded diagram will be really large
* theoretically RNN are powerful models suitable for all types of sequences
* in practice the learning of RNNs is very difficult due to the deep structure as a result of the feedback paths
* Empirical results have shown that, simple RNN structure as discussed above are only **good at modeling short-term dependency in sequences** 
* they fail to capture any dependency that spans a long distance in input sequences

**Q :** List out the structure variations of RNNs that resolves the above issue.

1. Long short-term memory (LSTM)
2. Gated Recurrent Units (GRU)
3. Higher Order Recurrent Neural Networks (HORNN)
4. Bidirectional RNNs


### **LSTM**

#### **Intro**

**Q :** What are the **basic ideas** behind LSTMs?

* address the shortcomings of traditional RNNs 
* incorporate memory cells and gating mechanisms in the RNN
* allow the network to learn and retain information over long sequence
* use a combination of forget, input, and output gates to enable the model to control the flow of information, ensuring that relevant context is preserved while unimportant information is discarded​
* address the VG problem

**Q :** What is the **main crux idea** of a LSTM network?

* RNNs have only one feedback path and it captures only short-term memory
* LSTMs use 2 different paths to retain short-term memory and long-term memory seperately

**Q :** LSTM is a modification of traditional RNN. List out the **modifications**.

* memory cells
* gates
* cell state in addition to the hidden state
* usage of sigmoid activation function for cell state and output activation in addition to tanh for hidden state and gates activation

**Q :** What do you mean by a **neuron** in LSTM network?

* In a standard neural network, a "neuron" refers to a single computational unit
* in LSTMs, a "neuron"/"unit" typically refers to an LSTM cell, each with its own gates (input, forget, and output gates)
* Each LSTM cell is a complex structure that : 
    - performs multiple computations 
    - manages memory over time

**Q :** What do you mean by a **layer** in an LSTM network?

* In a single LSTM layer, there are as many "neurons"/"units" as specified by the number of units or number of hidden dimensions when defining the LSTM layer 
* For example, if we specify an LSTM layer with 128 units, that layer will have 128 LSTM cells at each time step
* Each LSTM layer consists of multiple LSTM cells, all operating in parallel for each time step in a sequence

**Q :** What do you mean by an LSTM **neural network**?

* an LSTM network refers to a neural network with one or more LSTM layers
* You can have a single LSTM layer, or you can stack multiple LSTM layers on top of each other to create a deeper LSTM network
* This layered structure enhances the network's capacity to capture complex, hierarchical patterns in sequential data


**Q :** How does an **LSTM layer process a sequence**?

For a given input sequence(a sentence viz. a sequence of words) with T time steps( or elements) :
  - at each time step t:
       - The layer consists of multiple LSTM cells (let's say n cells) that process the same input vector(a word embedding ) $x_t$ simultaneously.
       - Each of these cells produces its own hidden state output based on the input $x_t$ and the previous hidden and cell states.
* The output of the LSTM layer at time step t is thus a collection of the hidden state output vectors from all n cells

**Q :** Draw a labelled diagram that represents an LSTM cell's structure.

<img src="Images/LSTM.jpeg" alt="image description" width=700 height=300>

<img src="Images/lstm1.jpg" alt="image description" width=250 height=200>

<img src="Images/lstm.jpg" alt="image description" width=500 height=250>

**Q :** At any instance of time, what are the data being fed into an LSTM neuron?

1. $X_t$ - current element from the input sequence (a vector)
2. $h_{t-1}$ - the hidden state output from previous time step which was delayed and stored for use in this step (a vector), that actually represents the short term memory 
3. $C_{t-1}$ - the cell state output from previous time step which was delayed and stored for use in this step (a vector), that actually represents the long term memory

#### **Forget Gate**

**Q :** What are the components in the forget gate? What is it's purpose?

**Components :**

1. 2 weight matrices $W_{f1}$, $W_{f2}$
2. a bias vector $B_f$
3. a sigmoid activation function 
4. current input $X_t$, is given as input
5. previous hidden state output $h_{t-1}$ is also given as input
6. outputs a vector with each component between 0 and 1 that denotes the fraction of current Long-term memory( viz. a vector of same dim ) to be remembered further

**Diagram :**

$$\begin{cases}1. & \text{Current Input; } X_t \\ 2. & \text{Previous hidden state output; } h_{t-1} \end{cases} \to \boxed{\text{ Forget Gate }} \to \text{Fraction of current LTM to be remembered}$$
**Math :**

$$\text{Forget Gate Activation } = \sigma(W_{f1}h_{t-1} + W_{f2}X_t + B_f) = f_t$$

This output is then multiplied by previous LTM vector to factor it out as :

$$\text{Prev. LTM to be remembered further} = C_{t-1}\cdot f_t$$

**Purpose :**
1. Selective memory retention
2. Prevention of memory overload
3. Handle long-term dependencies
4. Enable dynamic memory
5. Prevent gradient problems
6. Enhance model flexibility
7. Efficient resource usage

**Q :** How does LSTM serve each of its purpose?

1. **Selective Memory Retention**: 
    - By adjusting values in the forget gate output (from 0 to 1), the network can decide which parts of past information are important for future predictions and which can be discarded.

2. **Prevention of Memory Overload**: 
    - retaining too much past information can lead to memory overload
    - irrelevant or outdated information clutters the cell state 
    - allows the LSTM to forget less useful information
    - helps the model focus on more relevant data

3. **Handling Long-Term Dependencies**: 
   - enables the LSTM to learn long-term dependencies in sequential data
   - controls which information persists over time, the network can remember significant information across long gaps
   - useful in tasks like language modeling or time-series forecasting

4. **Enabling Dynamic Memory**: 
    - Different types of data sequences have varying needs for memory retention
    - it adapts dynamically based on the input at each time step, allowing the LSTM to adjust how much of the past information it keeps as context changes

5. **Preventing Gradient Explosions and Vanishing**: 
    - helps mitigate the vanishing gradient problem in RNNs 
    - it ensures that unnecessary information is "forgotten" early on
    - hence, prevents gradients from becoming too small or big
    - helps gradients propagate more effectively through longer sequences, aiding learning stability.

6. **Enhanced Model Flexibility**: 
    - adds flexibility by allowing the LSTM to learn which parts of the memory are useful at any point in the sequence
    - helps the LSTM perform well on complex, real-world tasks where relevant information can vary significantly over time

7. **Efficient Resource Use**: 
    - reduces the computational burden on the cell state
    - particularly beneficial for long sequences, where retaining all past information would be computationally wasteful

**Q :** What essentially is the forget gate trying to control?

* This gate determines what information should be discarded from the cell state or the available LTM. 
* It also uses a sigmoid function to produce values between 0 and 1, which indicate how much of each element in the cell state should be kept or removed
* A value of 0 means "completely forget," while a value of 1 means "keep it all." 

#### **Input Gate**

**Q :** What are the components in the input gate? What is it's purpose?

**Components :**

  * UNIT - 01
    - 2 weight matrices $W_{i11}, \ W_{i12}$
    - a bias vector $B_{i1}$
    - a sigmoid activation function
    - $X_t$, the current input vector is inputted
    - $h_{t-1}$, the previous hidden state output is also an input
    - outputs a vector $i_t$, whose each component is a number b/w 0 & 1 that indicates the fraction of potential LTM from current input to be remembered further

  * UNIT - 02
    - 2 weight matrices $W_{i21}, \ W_{i22}$
    - a bias vector $B_{i2}$
    - a tanh activation function
    - $X_t$, the current input vector is inputted
    - $h_{t-1}$, the previous hidden state output is also an input
    - outputs a vector $\tilde{C}_t$ whose each component is a number b/w -1 & 1 that indicates the potential LTM from current input

**Math :**

$$\text{Input Gate Activation } = \sigma(W_{i11}h_{t-1} + W_{i12}X_t + B_{i1}) = i_t$$

$$\text{Candidate Cell State } = tanh(W_{i21}h_{t-1} + W_{i22}X_t + B_{i2}) = \tilde{C}_t$$

**Purposes :**

1. controls the flow of new information into the cell state
2. selectively filters input
3. supports long-term dependencies
4. prevents memory overwriting
5. adapts memory updates dynamically
6. enhances temporal pattern recognition
7. balances short-term and long-term information
8. modulates the learning signal during training

**Q :** What does input activation and candidate cell state imply?

* **input gate activation** is a vector that determines which new information from the candidate cell state $\tilde{C}_t$ will be added to the cell state
* **candidate cell state** represents the "potential" memory that could be added to the cell state

**Q :** What does potential long term memory signify?

* represents the candidate cell state, the "new information" that could be added to the cell's memory
* signifies a potential update to the long-term memory (cell state) based on the current input and previous hidden state, 
* but this update is only incorporated into the actual cell state if the input gate allows it
* generated by applying a **tanh** activation function to a weighted combination of the current input $X_t$ and the previous hidden state $h_{t-1}$. The tanh function scales the candidate values between -1 and 1, making them suitable for updating the cell state.
* represents new information from the current input that the LSTM might want to "remember" for future steps
* If the input gate’s output is close to 1 for a specific piece of information, it suggests that this new information is important and should be added to the cell state
* If the input gate output is closer to 0, that information will be suppressed or only partially added.
* helps the LSTM incorporate relevant details from each time step, allowing it to gradually build up or update its memory with useful information.

**Q :** What essentially is the input gate trying to control?

* This gate controls how much new information from the current input should be added to the cell state or the LTM. 
* It uses a sigmoid activation function to determine which values to update and a tanh activation function to create a vector of new candidate values that could be added. 
* The input gate helps the LSTM decide what information from the input is relevant and should be incorporated into memory.

#### **Output Gate**

**Q :** What are the components in the output gate? What is it's purpose?

**Components :**

* UNIT - 01:
   - 2 weight matrices $W_{o1}$, $W_{o2}$
   - a bias vector $B_o$
   - a sigmoid activator
   - takes $X_t$, $h_{t-1}$ as inputs
   - outputs $o_t$, the output gate activation, a vector
* UNIT - 02:
   - no weight matrices or bias vector
   - a hyperbolic tan activator
   - takes $o_t$, $C_t$, the new LTM vector as inputs
   - outputs $h_t$, the current hidden state activation, a vector

**MATH :**

$$o_t = \sigma(W_{o1}h_{t-1} + W_{o2}X_t + B_o)$$
$$h_t = o_t\cdot tanh(C_t)$$

**PURPOSES :**

* controls the final output from the cell
* selectively filters the cell state information
* supports sequential predictions
* maintains long-term dependencies
* prevents information overload
* modulates gradient flow
* enhances representational power
* enables smooth transitions between states


**Q :** What does $o_t$ and $h_t$ signify?

* $o_t$ - represents the fraction of short term memory to be remembered further
* $h_t$ - represents updated version of the short term memory

**Q :** What essentially is the output gate trying to control?

* This gate controls what part of the cell state/LTM should be output to the next layer or the next time step. 
* It again uses a sigmoid function to decide which information to pass through and a tanh function to scale the cell state values to be between -1 and 1. 
* The output gate essentially decides how much of the cell state should influence the output at that time step.
* control the flow of information from the long-term memory (cell state) to the short-term memory (output), ensuring that only relevant information is passed along in the sequence processing.

**Q :** "Each gate in an LSTM neuron can be thought of as a tiny neural network". Justify

* each gate has its own set of parameters (weights and biases) and uses an activation function to transform the inputs. 
* These gates play specific roles in controlling the flow of information in and out of the cell, much like a neural network transforms inputs to achieve a specific function
* During backpropagation, the network adjusts these weights and biases to improve the LSTM’s performance in retaining, forgetting, or outputting the right information


**Q :** What is a memory cell in an LSTM?

* Its not a traditional neuron
* nor is it a traditional standalone layer in the LSTM network
* rather, it is a functional unit in an LSTM layer
* conceptually, it can be thought of as the layers memory
* it has a tangible implementation in code and in neural architectures of LSTM
* it represents a vector of values that holds the memory state over time
* it is a part of the internal structure of an LSTM layer, coexisting with gate components of it
* designed to store and manage information over long sequences
* allows the network to retain knowledge across multiple time steps
* The memory cell carries information that can be modified by carefully controlled mechanisms, enabling it to selectively remember or forget relevant details as needed
* persistent, adaptable store of information across time steps, managed by gates that regulate what to remember, forget, and output at each step
* makes LSTMs robust in handling long-range dependencies in sequential data