# Assignment 3

#### 1. Explain the basic architecture of RNN cell.

A Recurrent Neural Network (RNN) cell's fundamental structure is made up of three layers: an input layer, a hidden layer, and an output layer. A hidden state that stores the context and data from earlier steps is updated and maintained by the RNN cell as it processes sequential data.

The fundamental design of an RNN cell is described in the following manner:

1. Input Layer: The input layer receives the input sequence at each time step. In the case of text data, each time step corresponds to a word or a character in the sequence. The input is usually represented as a vector or embedding that captures the features of the input at that time step.

2. Hidden Layer: The hidden layer is responsible for capturing and maintaining the sequential information over time. At each time step, the hidden state is updated based on the input at that time step and the previous hidden state. This allows the RNN to capture the context and dependencies between different elements of the sequence. The hidden state is typically represented as a vector of fixed size, and its dimensionality determines the memory capacity of the RNN.

3. Output Layer: The output layer takes the hidden state at each time step and produces the corresponding output or prediction. The output can be a single value, a sequence of values, or a probability distribution over a set of classes, depending on the specific task the RNN is designed for.

The update of the hidden state in an RNN cell is typically done using a combination of the input at the current time step and the previous hidden state. This update operation is often represented by a recurrent weight matrix that connects the hidden state to itself. This weight matrix captures the temporal dependencies and allows the RNN to propagate information over multiple time steps.

One important characteristic of the RNN architecture is that it allows for the use of the same set of parameters (weights) across all time steps, enabling the model to process sequences of variable length.

The basic RNN cell can be extended to various variants, such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), which address the vanishing gradient problem and improve the ability to capture long-term dependencies in sequential data.

#### 2. Explain Backpropagation through time (BPTT)

Recurrent neural networks (RNNs) are trained using the backpropagation through time (BPTT) algorithm, which extends the backpropagation process to accommodate sequences or time-dependent data. RNNs, which include recurrent connections that enable information to be processed and propagated over time, have the challenge of propagating gradients across several time steps, and BPTT is specifically created to handle this issue.

The BPTT algorithm is explained in detail below:

1. Forward Pass: In the forward pass, the input sequence is fed into the RNN one time step at a time. The RNN processes each time step by updating its hidden state based on the input and the previous hidden state. The output at each time step is computed based on the current hidden state.

2. Loss Calculation: After the forward pass, a loss function is computed to measure the discrepancy between the predicted output and the desired output at each time step. The loss is typically computed using a suitable loss function for the specific task, such as mean squared error (MSE) for regression or cross-entropy loss for classification.

3. Backward Pass: In the backward pass, the gradients of the loss function with respect to the parameters of the RNN are computed. Starting from the final time step, the gradients are propagated backward through time, one time step at a time. At each time step, the gradients are computed based on the current gradient and the gradients from the subsequent time step.

4. Gradient Update: Once the gradients are computed, the parameters of the RNN are updated using an optimization algorithm, such as stochastic gradient descent (SGD) or Adam. The gradients are used to update the weights and biases of the RNN, allowing the model to learn from the errors and improve its predictions.

During training, the BPTT method basically unrolls the RNN throughout the whole length of the sequence, treating each time step independently. As a result, the gradients can be backpropagated throughout time, allowing the model to pick up knowledge from the entire sequence. It also presents difficulties like the disappearing or exploding gradient problem, in which gradients may vanish or increase exponentially over extended periods.

Gradient clipping and specialised RNN variations like Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) are frequently used as mitigation strategies for these problems. These variations make use of additional mechanisms to regulate information flow and deal with the vanishing gradient issue, making it possible to train RNNs using BPTT more successfully.

#### 3. Explain Vanishing and exploding gradients

Deep neural networks (DNNs) can experience disappearing and expanding gradients during training, especially recurrent neural networks (RNNs), which feature recurrent connections and long-term dependencies.

1. Vanishing Gradients: The vanishing gradient problem refers to the phenomenon where the gradients computed during backpropagation become extremely small as they are propagated from the output layer to the earlier layers of the network. This occurs when the gradients are successively multiplied by the weights in each layer during backpropagation, and if the weights are less than 1, the gradients can diminish exponentially.

When the gradients become very small, it becomes challenging for the network to update the weights effectively. The layers closer to the input receive weak gradients, resulting in slow learning or even no learning at all. Consequently, these layers fail to capture complex dependencies or meaningful patterns in the data, limiting the overall performance of the network.

2. Exploding Gradients: Conversely, the exploding gradient problem occurs when the gradients become extremely large during backpropagation. This can happen when the gradients are successively multiplied by weights greater than 1, leading to exponential growth. As a result, the weight updates become very large, causing unstable training and making it difficult for the model to converge.

The exploding gradient problem can cause the network to diverge, leading to unstable predictions and failure to learn meaningful representations from the data.

Both vanishing and exploding gradients can hinder the training of deep neural networks, including RNNs. These problems are particularly pronounced in deep architectures where the gradients need to be propagated through many layers or time steps. These issues are more prevalent in traditional RNNs due to the nature of their recurrent connections, which can amplify the impact of gradient propagation.

To mitigate these problems, several techniques have been developed:

1. Gradient Clipping: Gradient clipping is a technique that limits the magnitude of the gradients during backpropagation. By setting a threshold, the gradients are rescaled if they exceed the threshold, preventing them from growing too large or too small.

2. Weight Initialization: Proper initialization of the weights can also help alleviate the vanishing or exploding gradient problem. Techniques like Xavier or He initialization can be used to initialize the weights in a way that maintains a suitable range of gradients during training.

3. Nonlinear Activation Functions: The choice of activation functions can also impact gradient propagation. Using activation functions that do not saturate (e.g., ReLU, Leaky ReLU) can alleviate the vanishing gradient problem by allowing gradients to flow more effectively.

4. Gradient Regularization: Regularization techniques like L1 or L2 regularization, dropout, or batch normalization can help stabilize gradient updates and improve the generalization of the model.

5. Architectural Modifications: Using specialized RNN variants like Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) can address the vanishing gradient problem by incorporating gating mechanisms that regulate the flow of information and gradients.

#### 4. Explain Long short-term memory (LSTM)

Recurrent neural network (RNN) architecture known as Long Short-Term Memory (LSTM) was created to overcome the vanishing gradient problem and capture long-term dependencies in sequential data. When processing and producing predictions based on data sequences like text, speech, time series, or video, LSTMs are especially successful.

The insertion of memory cells, which enable the network to retain and access information over a wide time range, is the fundamental concept underpinning LSTM. Three essential parts—the input gate, the forget gate, and the output gate—help LSTMs do this. These gates manage the store and retrieval of memory data as well as the information flow within the LSTM unit.

Here's a breakdown of the components and their functions in an LSTM unit:

1. Memory Cell (C): The memory cell serves as the central element of an LSTM unit. It stores and updates information over time by incorporating three operations: input, forget, and output.

2. Input Gate (i): The input gate determines the extent to which new information should be added to the memory cell. It takes input from the current input data and the previous hidden state (output of the previous time step) and applies a sigmoid activation function. This gate decides which values should be updated and creates a candidate update vector (g) using a hyperbolic tangent activation function.

3. Forget Gate (f): The forget gate decides which information should be discarded from the memory cell. It takes input from the current input data and the previous hidden state and applies a sigmoid activation function. The output of this gate is multiplied element-wise with the previous cell state to selectively forget or retain information.

4. Output Gate (o): The output gate determines the extent to which the information stored in the memory cell should be exposed to the network's output. It takes input from the current input data and the previous hidden state and applies a sigmoid activation function. The output of this gate is multiplied element-wise with the updated memory cell state (after applying the input and forget operations) to produce the current hidden state.

5. Hidden State (h): The hidden state serves as the output of the LSTM unit and contains the summarized information from the input sequence. It is obtained by applying the output gate operation to the updated memory cell state.

#### 5. Explain Gated recurrent unit (GRU)

Similar to LSTM, the Gated Recurrent Unit (GRU) is a different type of recurrent neural network (RNN) architecture that was created to simulate long-term dependencies in sequential data and solve the vanishing gradient problem. By integrating the forget and input gates into a single update gate, GRUs streamline the LSTM architecture, resulting in a more streamlined and computationally effective model.

The following elements make up the GRU unit:

1. Update Gate (z): The update gate determines how much of the previous hidden state should be retained and how much of the new information should be integrated into the current hidden state. It takes input from the current input data and the previous hidden state and applies a sigmoid activation function. The output of this gate controls the flow of information from the previous hidden state to the current hidden state.

2. Reset Gate (r): The reset gate decides how much of the previous hidden state should be ignored while computing the current hidden state. It takes input from the current input data and the previous hidden state and applies a sigmoid activation function. The output of this gate is used to reset the hidden state, allowing the model to focus on the relevant information in the input sequence.

3. Current Hidden State (h): The current hidden state represents the memory or information accumulated from previous time steps. It is a combination of the previous hidden state modified by the update gate and a new candidate hidden state computed using the current input data and the reset gate. The candidate hidden state is computed by applying a hyperbolic tangent activation function to the concatenation of the current input and the element-wise product of the reset gate and the previous hidden state.

#### 6. Explain Peephole LSTM

The Long Short-Term Memory (LSTM) architecture is expanded by the Peephole LSTM design, which adds peephole connections. As a result, the LSTM cell now has direct access to the cell state in addition to the input and hidden state. This new connection gives the LSTM cell more knowledge so it can decide more carefully while learning.

The forget gate, input gate, and output gate, which manage the information flow through the cell, change the cell state in a normal LSTM cell. In contrast, a peephole LSTM directly connects the cell state to the gates through peephole connections. As a result, the gates have access to the current cell state and can utilise it to supplement their knowledge when making decisions.

The forget gate, input gate, and output gate formulae are amended to include the peephole connections as follows:

1. Forget Gate: The forget gate determines how much of the previous cell state should be forgotten. In a peephole LSTM, the forget gate takes into account both the previous hidden state and the previous cell state through peephole connections. The equation for the forget gate becomes:

   f_t = sigmoid(W_f * x_t + U_f * h_(t-1) + V_f * c_(t-1) + b_f)

2. Input Gate: The input gate controls how much of the new input should be stored in the cell state. Similar to the forget gate, the input gate in a peephole LSTM incorporates the previous hidden state and the previous cell state. The equation for the input gate becomes:

   i_t = sigmoid(W_i * x_t + U_i * h_(t-1) + V_i * c_(t-1) + b_i)

3. Cell State: The cell state is updated by combining the information from the input gate and the peephole connection from the forget gate. The equation for the updated cell state becomes:

   c_t = f_t * c_(t-1) + i_t * tanh(W_c * x_t + U_c * h_(t-1) + b_c)

4. Output Gate: The output gate determines how much of the current cell state should be outputted as the hidden state. In a peephole LSTM, the output gate includes a peephole connection from the current cell state. The equation for the output gate becomes:

   o_t = sigmoid(W_o * x_t + U_o * h_(t-1) + V_o * c_t + b_o)

5. Hidden State: The hidden state is computed by applying the output gate to the cell state. The equation for the hidden state becomes:

   h_t = o_t * tanh(c_t)

#### 7. Bidirectional RNNs

Recurrent neural networks (RNNs) that process input sequences both forward and backward are known as bidirectional RNNs. The hidden state in a typical RNN depends solely on past time steps at each time step. The hidden state at each time step in a bidirectional RNN, however, is affected by both the previous and subsequent time steps.

The architecture of a bidirectional RNN consists of two separate RNNs: one processing the input sequence in the forward direction (from the beginning to the end) and the other processing the input sequence in the backward direction (from the end to the beginning). Each RNN has its own set of parameters and hidden states. The hidden states from both directions are then combined or concatenated to form the final representation.

The forward RNN computes the hidden states in the standard manner, starting from the first input and moving forward through time. The backward RNN, on the other hand, processes the input sequence in reverse order, starting from the last input and moving backward through time. At each time step, the hidden state of the backward RNN depends on both the current input and the hidden state from the next time step.

The main advantage of using bidirectional RNNs is that they capture both past and future context for each time step. This can be particularly useful in tasks where the current output depends on the entire input sequence, such as sequence labeling, sentiment analysis, machine translation, and speech recognition. By incorporating information from both directions, bidirectional RNNs can capture more comprehensive and richer representations of the input sequence, improving the model's ability to understand and predict sequence patterns.

Bidirectional RNNs need the complete input sequence to be available up front, which may not be possible in cases involving streaming or real-time data. This is an essential point to keep in mind. Due to the requirement to analyse the sequence in both directions, bidirectional RNNs can frequently demand more memory and have higher computational complexity than regular RNNs. However, they are a useful tool for many sequence-related tasks due to their improved modelling capabilities.

#### 8. Explain the gates of LSTM with equations.

LSTM (Long Short-Term Memory) networks use gates to control the flow of information within the network. The gates are responsible for selectively allowing or blocking information from passing through, enabling the network to learn long-term dependencies. LSTM networks typically have three main gates: the input gate, the forget gate, and the output gate. Here's an explanation of each gate along with the corresponding equations:

1. Input Gate (i):
The input gate determines how much new information should be stored in the cell state. It takes the current input and the previous hidden state as inputs and outputs a value between 0 and 1 for each element in the cell state vector. The equations for the input gate are:

   i_t = σ(W_i * [h_(t-1), x_t] + b_i)

   Where:
   - i_t: input gate activation at time step t
   - σ: sigmoid activation function
   - W_i: weight matrix for the input gate
   - h_(t-1): previous hidden state
   - x_t: current input
   - b_i: bias vector for the input gate

2. Forget Gate (f):
The forget gate determines which information from the previous cell state should be discarded. It takes the current input and the previous hidden state as inputs and outputs a value between 0 and 1 for each element in the cell state vector. The equations for the forget gate are:

   f_t = σ(W_f * [h_(t-1), x_t] + b_f)

   Where:
   - f_t: forget gate activation at time step t
   - σ: sigmoid activation function
   - W_f: weight matrix for the forget gate
   - h_(t-1): previous hidden state
   - x_t: current input
   - b_f: bias vector for the forget gate

3. Output Gate (o):
The output gate determines how much information from the cell state should be exposed to the current hidden state and the output. It takes the current input and the previous hidden state as inputs and outputs a value between 0 and 1 for each element in the cell state vector. The equations for the output gate are:

   o_t = σ(W_o * [h_(t-1), x_t] + b_o)

   Where:
   - o_t: output gate activation at time step t
   - σ: sigmoid activation function
   - W_o: weight matrix for the output gate
   - h_(t-1): previous hidden state
   - x_t: current input
   - b_o: bias vector for the output gate

These gate activations control the flow of information through the LSTM cell. The cell state is updated using the following equations:

   C_t = f_t * C_(t-1) + i_t * tanh(W_C * [h_(t-1), x_t] + b_C)

   Where:
   - C_t: updated cell state at time step t
   - f_t: forget gate activation at time step t
   - C_(t-1): previous cell state
   - i_t: input gate activation at time step t
   - tanh: hyperbolic tangent activation function
   - W_C: weight matrix for the cell state
   - h_(t-1): previous hidden state
   - x_t: current input
   - b_C: bias vector for the cell state

The updated cell state is then used to compute the current hidden state:

   h_t = o_t * tanh(C_t)

   Where:
   - h_t: current hidden state
   - o_t: output gate activation at time step t
   - tanh: hyper

bolic tangent activation function
   - C_t: updated cell state at time step t

#### 9. Explain BiLSTM

An addition to the LSTM architecture called BiLSTM (Bidirectional Long Short-Term Memory) includes knowledge from both the past and the future. The BiLSTM simultaneously processes the input sequence in both the forward and backward directions, unlike the regular LSTM, which only takes into account the previous context.

The BiLSTM is made up of two LSTM layers, one of which processes the input sequence forwardly (from beginning to finish) and the other of which processes it backwardly (from end to beginning). There are distinct hidden states and cell states for every LSTM layer.

Here is a detailed description of how BiLSTM operates:

1. Input Encoding:
The input sequence is encoded into a sequence of word embeddings or other input representations. Each word in the sequence is transformed into a fixed-size vector representation.

2. Forward LSTM:
The forward LSTM layer processes the input sequence in the forward direction. At each time step, the forward LSTM takes the current input and the previous hidden state as inputs and computes the new hidden state and cell state using the LSTM equations. The hidden state captures information from the past context.

3. Backward LSTM:
The backward LSTM layer processes the input sequence in the backward direction. At each time step, the backward LSTM takes the current input and the previous hidden state as inputs and computes the new hidden state and cell state using the LSTM equations. The hidden state captures information from the future context.

4. Concatenation:
After both forward and backward LSTM layers have processed the input sequence, the hidden states from both directions are concatenated at each time step. This combines the information from the past and future contexts into a single representation.

5. Output:
The concatenated hidden states can be further processed by additional layers or directly used for the desired task, such as sequence classification or sequence-to-sequence translation.

#### 10. Explain BiGRU

The GRU (Gated Recurrent Unit) architecture's BiGRU (Bidirectional Gated Recurrent Unit) variation simultaneously processes the input sequence in both the forward and backward directions. Similar to BiLSTM, BiGRU adds past and future context information, making it possible to grasp the input sequence more thoroughly.

Here is a detailed description of how BiGRU operates:

1. Input Encoding:
The input sequence is encoded into a sequence of word embeddings or other input representations. Each word in the sequence is transformed into a fixed-size vector representation.

2. Forward GRU:
The forward GRU layer processes the input sequence in the forward direction. At each time step, the forward GRU takes the current input and the previous hidden state as inputs and computes the new hidden state using the GRU equations. The hidden state captures information from the past context.

3. Backward GRU:
The backward GRU layer processes the input sequence in the backward direction. At each time step, the backward GRU takes the current input and the previous hidden state as inputs and computes the new hidden state using the GRU equations. The hidden state captures information from the future context.

4. Concatenation:
After both forward and backward GRU layers have processed the input sequence, the hidden states from both directions are concatenated at each time step. This combines the information from the past and future contexts into a single representation.

5. Output:
The concatenated hidden states can be further processed by additional layers or directly used for the desired task, such as sequence classification or sequence-to-sequence translation.