# Transformers in Deep Learning : 

In the context of deep learning, transformers refer to a type of neural network architecture introduced in the paper "Attention Is All You Need" by Vaswani et al. in 2017. Transformers revolutionized the field of natural language processing (NLP) and have since been successfully applied to various other domains.

Traditional deep learning models, such as recurrent neural networks (RNNs), process sequential data by sequentially updating hidden states. While RNNs are effective in capturing sequential dependencies, they suffer from limitations such as difficulty in parallelization and capturing long-range dependencies.

Transformers, on the other hand, rely on a mechanism called self-attention to process sequential data in a parallelizable manner and capture long-range dependencies more effectively. The main idea behind transformers is the concept of attention, which allows the model to focus on different parts of the input when making predictions.



The architecture of a transformer consists of two main components: the encoder and the decoder. Both the encoder and decoder are composed of multiple layers of self-attention mechanisms and feed-forward neural networks.

The encoder takes an input sequence (e.g., a sentence) and processes it by applying self-attention over the entire input sequence. Self-attention allows each position in the input sequence to attend to other positions, enabling the model to weigh the importance of different words or tokens in the sequence. This attention mechanism generates a set of context-aware representations for each input token.

The decoder takes as input the previously generated output (during training, it also receives the ground truth tokens) and uses a combination of self-attention and encoder-decoder attention mechanisms to generate the next token in the sequence. The encoder-decoder attention allows the decoder to attend to the context representations generated by the encoder, ensuring that the model incorporates information from the entire input sequence.



The self-attention mechanism in transformers is based on the notion of key-value pairs. Each input token is transformed into three vectors: the query, key, and value. These vectors are obtained through linear transformations of the input embeddings. The self-attention mechanism computes attention weights between query and key pairs by measuring their similarity using dot product or other similarity measures. The attention weights are then used to weigh the corresponding values and compute a weighted sum, which represents the context-aware representation of each token.

The self-attention mechanism in transformers allows the model to capture dependencies between all positions in the input sequence simultaneously, providing a more efficient and effective way of processing sequential data compared to traditional recurrent models.

Transformers have gained immense popularity in NLP tasks such as machine translation, language modeling, text classification, and question-answering. They have also been successfully applied to other domains, including computer vision and speech recognition. Notable transformer-based models include BERT, GPT, and T5, which have achieved state-of-the-art performance on a wide range of benchmarks.

# Encoder And Decoder in Transformers :
    
    
In transformers, the encoder and decoder are the two main components of the architecture that work together to process sequential data and generate predictions. Here's a detailed explanation of the encoder and decoder in transformers:

(1). Encoder:
The encoder component of a transformer takes an input sequence (e.g., a sentence) and processes it to generate a set of context-aware representations for each token in the sequence. The encoder is composed of multiple layers of self-attention mechanisms and feed-forward neural networks.

(a). Self-Attention Layer: Each layer in the encoder has a self-attention layer. The self-attention mechanism allows each position in the input sequence to attend to other positions. It computes attention weights between query and key pairs, which measure the similarity between different positions in the sequence. The attention weights are then used to weigh the corresponding values, which represent the context-aware representation of each token. This self-attention mechanism enables the model to capture dependencies between all positions in the input sequence simultaneously.

(b). Feed-Forward Neural Network: After the self-attention layer, a feed-forward neural network is applied to each token's context-aware representation independently. This network consists of a fully connected layer with a non-linear activation function, such as the Rectified Linear Unit (ReLU). It helps in introducing non-linearity and further transforming the representations.

The encoder processes the input sequence through multiple layers of self-attention and feed-forward networks, allowing the model to capture complex patterns and dependencies in the data. The final output of the encoder is a set of enriched representations for each token in the input sequence.





(2). Decoder:
The decoder component of a transformer takes the context-aware representations generated by the encoder and generates the output sequence (e.g., a translated sentence). Similar to the encoder, the decoder is composed of multiple layers of self-attention mechanisms and feed-forward neural networks.


(a). Self-Attention Layer: Each layer in the decoder also has a self-attention layer, but with a slight modification compared to the encoder. In addition to attending to other positions in the input sequence, the decoder's self-attention layer also attends to the previously generated positions in the output sequence. This modification is crucial to ensure that the model can generate the next token in the sequence based on the context from both the input sequence and the previously generated output.


(b). Encoder-Decoder Attention Layer: After the self-attention layer, the decoder has an encoder-decoder attention layer. This layer allows the decoder to attend to the context representations generated by the encoder. It helps the decoder incorporate information from the entire input sequence and leverage the encoder's understanding of the source sequence when generating the output.


(c). Feed-Forward Neural Network: Similar to the encoder, a feed-forward neural network is applied to each token's context-aware representation in the decoder, transforming the representations and introducing non-linearity.

The decoder generates the output sequence token by token, attending to both the input sequence and the previously generated output at each step. The final output of the decoder is the fully generated sequence.

By combining the encoder and decoder components, transformers can effectively process sequential data, capture long-range dependencies, and generate high-quality predictions in various tasks such as machine translation, language modeling, and text generation.

# Difference between Encoder And Decoder :
    
    
The encoder and decoder are two distinct components in deep learning architectures, often used in sequence-to-sequence tasks such as machine translation. Here's a comparison of the encoder and decoder:

(1). Encoder:

(a). Function: The encoder processes an input sequence and transforms it into a fixed-dimensional representation. It encodes the input information into a condensed representation, capturing important features and patterns of the input sequence.

(b). Input: The encoder receives the input sequence, which can be a sequence of words, audio signals, or any sequential data.

(c). Output: The encoder produces a context-aware representation for each element in the input sequence. This representation summarizes the relevant information from the entire input sequence, capturing its semantics and capturing dependencies between elements.

(d). Usage: The encoder's output can be used for downstream tasks such as classification, information retrieval, or as input to the decoder in sequence-to-sequence tasks.
    
    
    
    
    
(2). Decoder:

(a). Function: The decoder takes the encoded representation produced by the encoder and generates an output sequence. It utilizes the encoded information to produce a sequence of elements (e.g., words, audio signals) that are coherent and relevant to the task.

(b). Input: The decoder takes the encoded representation from the encoder and, during training, also takes the ground truth output sequence (teacher forcing). During inference, it uses its own generated output as input for the next time step.

(c). Output: The decoder generates an output sequence, one element at a time, by attending to the encoded representation and previously generated elements. The output sequence can be of a different length or format than the input sequence.

(d). Usage: The decoder is commonly used in tasks such as machine translation, text generation, image captioning, and speech synthesis, where the goal is to generate a meaningful output sequence based on the encoded input representation.

# Query , Key , Value in Self Attention :: 
        
        
In a self-attention mechanism, which is a fundamental component of transformers, the query, key, and value are three essential components that play a crucial role in computing attention weights. Here's a detailed explanation of each component:

(a). Query:
    
The query is a vector that represents the current position or token for which we want to compute attention weights. In the context of self-attention, each input token has an associated query vector. The query vector is obtained by applying a linear transformation (typically a matrix multiplication) to the input token's embedding. The purpose of the query is to capture the characteristics or properties of the current position or token that we are interested in attending to.

(b). Key:
    
The key is a vector that represents the other positions or tokens in the input sequence that the query should attend to. Like the query, each input token has an associated key vector. The key vector is obtained by applying a linear transformation (similar to the query) to the input token's embedding. The key vector represents the information or features of the other tokens in the sequence.

(c). Value:
The value is a vector that contains information or content associated with each token in the input sequence. Similar to the query and key, each input token has an associated value vector. The value vector is obtained by applying a linear transformation (similar to the query and key) to the input token's embedding. The value vector typically represents the content or meaning of the token.

The self-attention mechanism computes attention weights by measuring the similarity between the query and key vectors. The dot product is a commonly used method for calculating this similarity. By taking the dot product between the query and key vectors, we obtain a scalar value, which can be further transformed using a softmax function to obtain the attention weights. These attention weights determine the importance or relevance of each key-value pair to the current query.

Once the attention weights are computed, they are used to weigh the corresponding value vectors. The weighted sum of the value vectors gives the context or representation of the current position or token based on the attention mechanism. This context vector captures the relevant information or features from the other positions in the sequence.

# Mathematical Portion of Self-Attention Transformers ::
        
        
        
Let's consider the self-attention calculation for a single head of the multi-head attention mechanism.

Given an input sequence of tokens, denoted as X = [x₁, x₂, ..., xn], we transform each token using linear transformations to obtain query (Q), key (K), and value (V) vectors:

Q = XW_Q
K = XW_K
V = XW_V

Here, W_Q, W_K, and W_V are learnable weight matrices.

Next, we compute the attention scores between each query and key pair using dot product:

Attention_scores = softmax(QK^T / √(d_k))

Here, d_k represents the dimensionality of the key vectors. The softmax function is applied row-wise to ensure the attention scores sum up to 1.

The attention scores are then used to weight the value vectors:

Weighted_values = Attention_scoresV

Finally, the weighted values are summed up to obtain the context vector:

Context = Σ Weighted_values

This process can be summarized as:

Context = softmax(QK^T / √(d_k))V

The above equations represent the core mathematical operations of self-attention in the transformer model. In practice, multi-head attention is used, which involves performing these calculations in parallel for multiple attention heads and then concatenating the results.

Additionally, the self-attention mechanism in transformers incorporates scaling by √(d_k) in the denominator of the attention scores. This scaling factor is used to address the issue of vanishing gradients when the dot product of the query and key vectors becomes large, ensuring more stable training.