# <font color='blue' size='5px'/>NN Architecture<font/>

# Introduction

![image](https://i.pinimg.com/originals/9e/28/ac/9e28ac37571d44ff4075667d94b92875.png)

# AlexNet

# EfficientNet

# Transformers

Transformers and encoder-decoder architectures are typically considered as higher-level architectural components rather than individual layers in the context of neural network design. Let's clarify their roles:

1. **Transformers**: Transformers are a type of neural network architecture introduced in the "Attention is All You Need" paper by Vaswani et al. They are primarily used for sequence-to-sequence tasks, such as natural language processing, machine translation, and text generation. The core innovation in transformers is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when making predictions.

    - **Transformer Encoder**: The encoder part of the transformer focuses on encoding input sequences into a fixed-length representation. It consists of multiple self-attention layers and feedforward layers. Each layer in the encoder can be thought of as a "transformer layer." The entire encoder stack is an architectural component.

    - **Transformer Decoder**: The decoder part of the transformer generates output sequences from the encoder's representation. Like the encoder, the decoder also consists of multiple self-attention layers and feedforward layers. The entire decoder stack is also an architectural component.

So, while you can refer to individual components of transformers (e.g., a single self-attention layer) as layers, the transformer itself is more accurately described as an architectural design consisting of encoder and decoder components.

2. **Encoder-Decoder Architectures**: Encoder-decoder architectures are a broader class of neural network architectures used for tasks like machine translation, image captioning, and sequence-to-sequence modeling. These architectures consist of two primary components:

    - **Encoder**: The encoder processes the input data and encodes it into a fixed-length representation, typically a vector. The encoder can use various types of layers, such as recurrent layers, convolutional layers, or transformers (as discussed earlier). The encoder itself can be thought of as a single architectural component.

    - **Decoder**: The decoder takes the encoder's representation and generates an output sequence, often of variable length. Like the encoder, the decoder can use various layers depending on the task.

The encoder-decoder architecture represents a high-level organization of the model, specifying how information flows from input to output. Within the encoder and decoder components, you may have multiple layers, including recurrent, convolutional, or transformer layers, depending on the architecture chosen for each component.

In summary, transformers and encoder-decoder architectures are higher-level architectural designs that encompass multiple layers and specify how these layers are connected to perform specific tasks, such as sequence-to-sequence modeling or language understanding and generation.

Certainly, let's explore the Transformer model, which is composed of both the Encoder and Decoder components, working together to process input sequences and generate output sequences:

**Transformer Model (Encoder-Decoder):**

**Purpose:**
- The Transformer model combines both the Encoder and Decoder components to handle sequence-to-sequence tasks. It is widely used in machine translation, text generation, and other sequence-to-sequence tasks.

**Key Components:**
1. **Encoder:** The Encoder is responsible for processing the input sequence. It applies multi-head self-attention and feedforward neural networks to capture dependencies and relationships between elements in the input sequence.

2. **Decoder:** The Decoder is responsible for generating the output sequence based on the encoded input. It applies multi-head self-attention, multi-head cross-attention with the encoded input, and position-wise feedforward networks.

3. **Positional Encodings:** Transformers do not have built-in information about the order of elements in a sequence. Positional encodings are added to the input embeddings to provide information about the positions of elements.

4. **Masking:** Masking is used to prevent the model from attending to future positions during training. This ensures that the model generates output sequentially and does not cheat by looking ahead.

5. **Layer Normalization:** Normalization techniques are applied to stabilize the training process in both the Encoder and Decoder.

6. **Residual Connections:** Residual connections (skip connections) are used to mitigate the vanishing gradient problem and enable the training of very deep networks in both the Encoder and Decoder.

**Mathematics:**
- The Transformer model combines the operations of both the Encoder and Decoder:
  - Input sequences are passed through the Encoder to obtain encoded representations.
  - These encoded representations are then passed through the Decoder to generate the output sequences.
  
**Example Code in PyTorch:**
```python
import torch.nn as nn

class Transformer(nn.Module):
    def __init__(self, encoder, decoder):
        super(Transformer, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, src, tgt, src_mask, tgt_mask):
        # Pass input sequence through the Encoder
        memory = self.encoder(src, src_mask)
        
        # Pass encoded memory and target sequence through the Decoder
        output = self.decoder(tgt, memory, tgt_mask, src_mask)
        
        return output
```

In the example code above, we define a simplified Transformer model using PyTorch. It combines the operations of both the Encoder and Decoder to perform sequence-to-sequence tasks. The input sequences are first processed by the Encoder, and then the encoded memory is used by the Decoder to generate the output sequences.

The Transformer model has revolutionized the field of natural language processing and is widely used for various sequence-to-sequence tasks due to its ability to capture long-range dependencies and relationships between elements in sequences.

## Encoder Layers


Certainly, let's explore the Encoder Layer in the Transformer architecture, which plays a crucial role in processing input sequences:

**Encoder Layer in Transformer:**

**Purpose:**
- The Encoder Layer in the Transformer is responsible for processing the input sequence. It applies multi-head self-attention and feedforward neural networks to capture dependencies and relationships between elements in the sequence.

**Key Components:**
1. **Multi-Head Self-Attention:** This mechanism allows the model to weigh the importance of different elements in the input sequence when making predictions for a particular element. It captures dependencies and relationships between words in a sentence.

2. **Position-wise Feedforward Networks:** After self-attention, the model applies a feedforward neural network to each position in the sequence independently. This helps capture complex interactions between elements.

3. **Layer Normalization:** Normalization techniques are applied to stabilize the training process.

4. **Residual Connections:** Residual connections (skip connections) are used to mitigate the vanishing gradient problem and enable the training of very deep networks.

**Mathematics:**
- The Encoder Layer in the Transformer can be mathematically represented as follows:
  \[ \text{Output} = \text{LayerNorm}(\text{MultiHeadSelfAttention}(\text{Input}) + \text{Input})\]
  \[ \text{FinalOutput} = \text{LayerNorm}(\text{PositionwiseFeedforward}(\text{Output}) + \text{Output})\]

**Example Code in PyTorch:**
```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward, dropout):
        super(TransformerEncoderLayer, self).__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead)
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, src):
        # Multi-Head Self-Attention
        src2 = self.self_attn(src, src, src)[0]
        src = src + self.dropout(src2)
        src = self.norm1(src)

        # Position-wise Feedforward
        src2 = self.linear2(F.relu(self.linear1(src)))
        src = src + self.dropout(src2)
        src = self.norm2(src)

        return src
```

In the example code above, we define a simplified version of a Transformer Encoder Layer using PyTorch. This layer performs multi-head self-attention followed by position-wise feedforward operations. It also includes normalization and dropout for stability and regularization. This layer can be stacked multiple times to create a full Transformer Encoder.

The Encoder Layer in the Transformer is a fundamental component that allows the model to process input sequences effectively, capturing dependencies and relationships between elements. It is used in a variety of natural language processing tasks, including machine translation and text understanding.

## Decoder Layer

Certainly, let's explore the Decoder Layer in the Transformer architecture, which is responsible for generating output sequences based on the encoded input:

**Decoder Layer in Transformer:**

**Purpose:**
- The Decoder Layer in the Transformer is responsible for generating output sequences, such as translations in machine translation tasks. It applies multi-head self-attention and cross-attention mechanisms to capture dependencies between the output sequence and the input sequence.

**Key Components:**
1. **Multi-Head Self-Attention:** Similar to the Encoder Layer, the Decoder Layer uses multi-head self-attention to capture dependencies within the output sequence. It allows the model to weigh the importance of different elements in the output when making predictions for a particular element.

2. **Multi-Head Cross-Attention:** In addition to self-attention, the Decoder Layer applies multi-head cross-attention to the encoded input sequence. This mechanism captures dependencies between the output and input sequences, enabling the model to generate contextually relevant output.

3. **Position-wise Feedforward Networks:** After self-attention and cross-attention, the model applies a feedforward neural network to each position in the sequence independently. This helps capture complex interactions between elements.

4. **Layer Normalization:** Normalization techniques are applied to stabilize the training process.

5. **Residual Connections:** Residual connections (skip connections) are used to mitigate the vanishing gradient problem and enable the training of very deep networks.

**Mathematics:**
- The Decoder Layer in the Transformer can be mathematically represented as follows:
  \[ \text{Output} = \text{LayerNorm}(\text{MultiHeadSelfAttention}(\text{Input}) + \text{Input})\]
  \[ \text{FinalOutput} = \text{LayerNorm}(\text{MultiHeadCrossAttention}(\text{Output}, \text{EncodedInput}) + \text{Output})\]
  \[ \text{FinalFinalOutput} = \text{LayerNorm}(\text{PositionwiseFeedforward}(\text{FinalOutput}) + \text{FinalOutput})\]

**Example Code in PyTorch:**
```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class TransformerDecoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward, dropout):
        super(TransformerDecoderLayer, self).__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead)
        self.cross_attn = nn.MultiheadAttention(d_model, nhead)
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None):
        # Multi-Head Self-Attention
        tgt2 = self.self_attn(tgt, tgt, tgt, attn_mask=tgt_mask)[0]
        tgt = tgt + self.dropout(tgt2)
        tgt = self.norm1(tgt)

        # Multi-Head Cross-Attention with the encoded input
        tgt2 = self.cross_attn(tgt, memory, memory, attn_mask=memory_mask)[0]
        tgt = tgt + self.dropout(tgt2)
        tgt = self.norm2(tgt)

        # Position-wise Feedforward
        tgt2 = self.linear2(F.relu(self.linear1(tgt)))
        tgt = tgt + self.dropout(tgt2)
        tgt = self.norm3(tgt)

        return tgt
```

In the example code above, we define a simplified version of a Transformer Decoder Layer using PyTorch. This layer performs multi-head self-attention, multi-head cross-attention with the encoded input, and position-wise feedforward operations. It also includes normalization and dropout for stability and regularization. This layer can be stacked multiple times to create a full Transformer Decoder.

The Decoder Layer in the Transformer is a critical component for generating output sequences based on the encoded input. It is used in various sequence-to-sequence tasks, including machine translation, text generation, and language modeling.