# Encoder-Only Models

Encoder-only architectures are a subset of transformer models designed solely for understanding or encoding input data without generating new text. These models are used when the goal is to analyze or classify text rather than generate responses.

### Common Applications: 
  - **Text Classification:**  
    Ideal for tasks like sentiment analysis, topic categorization, and spam detection, where the model assigns a label or score to the input text.
  - **Sequence Labeling:**  
    Useful for tasks where each token in the text must be labeled (e.g., Named Entity Recognition, Part-of-Speech Tagging). The model outputs a label for every token based on its contextualized representation.

![](https://miro.medium.com/v2/resize:fit:680/1*fNQ9_cJ0Jo78U01nYz-Hcg.jpeg)

## How They Work

### Transformers as Feature Extractors

#### **Tokenization and Embedding:**  
  Input text is first tokenized into individual tokens (words or subwords). Each token is then mapped to a dense vector using an embedding layer. These embeddings serve as the starting point, capturing basic semantic information about the tokens.

#### **Stack of Transformer Encoder Layers:**  
  The embedded tokens are passed through a series of transformer encoder layers. Each layer consists of several key components:
  - **Self-Attention Mechanism:**  
    Every token in the sequence is allowed to attend to every other token. This mechanism computes attention scores that determine how much influence each token should have on the others, thereby integrating context from the entire sequence.
  - **Feed-Forward Networks:**  
    After the self-attention step, the output is processed by a fully-connected feed-forward network. This introduces non-linear transformations and helps in learning higher-level representations.
  - **Normalization and Residual Connections:**  
    Layer normalization and residual connections are applied after both the self-attention and feed-forward steps. These techniques help stabilize training, maintain information flow, and prevent the vanishing gradient problem.

#### **Contextualized Representations:**  
  As tokens pass through successive encoder layers, their representations become highly contextualized. This means each token’s final high-dimensional embedding encodes not only its own meaning but also its relationships and dependencies with every other token in the sequence.

### Classification via the CLS Token

#### **Introducing the CLS Token:**  
  For tasks that require a single output for the entire sequence (such as text classification), a special token known as the **CLS token** is inserted at the beginning of the input sequence (CLS for Classsification). This token does not represent actual content but serves as a placeholder for aggregated information.

#### **Aggregating Information:**  
  As the sequence flows through the transformer layers, the CLS token interacts with all other tokens via self-attention. Its final hidden state effectively becomes a summary representation of the entire input sequence.

#### **Classifier Head:**  
  The summary representation captured by the CLS token is then fed into a classifier—often a simple feed-forward network or a couple of layers followed by a softmax activation—to produce the final prediction, such as a sentiment label or topic category.

![](https://miro.medium.com/v2/resize:fit:1400/0*O4yM2Kis_k2S5M40.png)

### Sequence Labeling

#### **Token-Level Predictions:**  
  For tasks requiring predictions for each token (like Named Entity Recognition or Part-of-Speech Tagging), the model uses the final hidden state of each token instead of just the CLS token. Each token’s contextualized embedding is passed through a classification layer to predict labels at the token level.

#### **Enhancing Predictions with Structured Techniques:**  
  In some sequence labeling tasks, additional structured prediction layers such as Conditional Random Fields (CRFs) are applied on top of the token-level outputs. This helps capture the dependencies between labels, ensuring that the sequence of predictions is coherent and contextually valid.


# BERT (Bidirectional Encoder Representation Transformer)

BERT is one of the most well-known encoder-only transformer models. It leverages a deep bidirectional architecture to generate powerful language representations. 

![](BERT.png)


## BERT Architecture

- **Encoder-Only Design:**  
  BERT is built entirely using transformer encoder layers. Unlike models that include both an encoder and a decoder (e.g., GPT), BERT is solely focused on understanding input text. This specialization makes it highly effective for tasks like text classification, sentiment analysis, and question answering, where the goal is to derive meaning from the input rather than generate new text.

#### **Bidirectionality:**  
  A hallmark of BERT is its bidirectional self-attention mechanism.  
  - **How It Works:**  
    During training, every token in the input sequence attends to all other tokens simultaneously, both to its left and right. This allows the model to capture context from both directions.  
  - **Significance:**  
    This bidirectional approach enables BERT to develop a richer understanding of language, as it can capture nuances that arise from the full context of a sentence. It contrasts with traditional language models that only process text in one direction (left-to-right or right-to-left).

#### **Input Representations:**  
  BERT creates a comprehensive input representation by summing three types of embeddings:
  - **Token Embeddings:**  
    Each token (word or subword) is mapped to a dense vector that encodes its basic semantic properties.
  - **Position Embeddings:**  
    Since transformers lack an inherent sense of order, position embeddings are added to each token to encode its position in the sequence, ensuring that the order of words is preserved.
  - **Segment Embeddings:**  
    When processing tasks that involve pairs of sentences (e.g., next sentence prediction), segment embeddings are used to distinguish tokens belonging to different sentences. This helps the model understand the relationship between segments within the same input.


## BERT Pre-Training

BERT is first pre-trained on large volumes of unlabeled text which enables it to learn robust and generalizable language representations. There are two common approaches taken here:

#### **Masked Language Modeling (MLM):**  
  - **Concept:**  
    A percentage (typically about 15%) of the input tokens is randomly selected and replaced with a special `[MASK]` token.
  - **Task:**  
    The model must predict the original token for each masked position using the context provided by the surrounding tokens.
  - **Purpose:**  
    This encourages the model to learn deep, bidirectional representations of language, as it must infer missing words based on both preceding and following context.

#### **Next Sentence Prediction (NSP):**  
  - **Concept:**  
    BERT is presented with pairs of sentences, where some pairs are consecutive sentences from the original text, while others are randomly paired.
  - **Task:**  
    The model must predict whether the second sentence logically follows the first.
  - **Purpose:**  
    NSP helps BERT capture inter-sentence relationships, which is beneficial for tasks that involve understanding the coherence and logical flow of text, such as question answering and natural language inference.

#### **Outcome of Pre-Training:**  
  Through MLM and NSP, BERT develops a rich understanding of language. It learns syntax, semantics, and even some world knowledge, which forms a foundation that can be fine-tuned for various tasks with minimal additional data.


## BERT Fine-Tuning

After pre-training, BERT is adapted to specific downstream tasks via fine-tuning, where the model's general language understanding is tailored to the particular requirements of the task.

#### **Task-Specific Adaptation:**  
  Fine-tuning involves adding a small, task-specific layer on top of the pre-trained BERT model.  
  - **Example:**  
    For a text classification task, the representation corresponding to the `[CLS]` token is passed through a softmax layer to produce a probability distribution over class labels.

#### **Minimal Data Requirements:**  
  Because BERT has already learned rich representations during pre-training, fine-tuning on downstream tasks often requires only a small amount of labeled data. This makes BERT highly efficient in scenarios where labeled data is limited.

#### **Training Process:**  
  - **Input Format:**  
    The format remains similar to the pre-training phase, with inputs starting with a `[CLS]` token (and using `[SEP]` tokens to delimit sentences when needed).
  - **Optimization:**  
    During fine-tuning, the entire model—including the pre-trained layers and the new task-specific layers—is trained jointly. Optimization is typically performed using gradient descent-based methods like Adam or AdamW.
  - **Learning Rate and Epochs:**  
    Fine-tuning generally employs a smaller learning rate to prevent large updates that could erase the valuable language representations acquired during pre-training. The process usually runs for only a few epochs, ensuring the model adapts to the task without overfitting.

This comprehensive process—from encoder-only architecture and bidirectional context understanding to sophisticated pre-training and efficient fine-tuning—forms the backbone of BERT’s success in a wide range of natural language processing tasks.

![](BERT_Comps.png)

---

# Building BERT Architecture from Scratch 

Below is a simplified implementation of a BERT-like model using PyTorch. 

This includes:
- Embedding layers (token, position, and segment embeddings)
- A stack of Transformer encoder layers to create contextualized representations
- A classification head that uses the [CLS] token representation for sequence classification

**Note:** This is a minimal example for educational purposes and does not include all optimizations of the original BERT.

In [None]:
!pip install torchinfo

In [24]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchinfo import summary

#### Building The Embeddings

In [25]:
class BertEmbeddings(nn.Module):
    def __init__(self, vocab_size, hidden_size, max_position_embeddings, type_vocab_size):
        super(BertEmbeddings, self).__init__()
        self.word_embeddings = nn.Embedding(vocab_size, hidden_size)
        self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size)
        self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size)
        
        self.LayerNorm = nn.LayerNorm(hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(0.1)
    
    def forward(self, input_ids, token_type_ids=None):
        seq_length = input_ids.size(1)
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)
        position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
        
        words_embeddings = self.word_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)
        
        embeddings = words_embeddings + position_embeddings + token_type_embeddings
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

#### Building The BERT Model

In [26]:
class BertModel(nn.Module):
    def __init__(self, vocab_size, hidden_size=768, num_hidden_layers=12, 
                 num_attention_heads=12, intermediate_size=3072, max_position_embeddings=512, 
                 type_vocab_size=2):
        super(BertModel, self).__init__()
        self.embeddings = BertEmbeddings(vocab_size, hidden_size, max_position_embeddings, type_vocab_size)
        
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=hidden_size, 
            nhead=num_attention_heads, 
            dim_feedforward=intermediate_size, 
            dropout=0.1,
            activation='gelu'
        )
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_hidden_layers)
    
    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
        embeddings = self.embeddings(input_ids, token_type_ids)
        encoder_input = embeddings.transpose(0, 1)
        
        if attention_mask is not None:
            encoder_output = self.encoder(encoder_input, src_key_padding_mask=(attention_mask == 0))
        else:
            encoder_output = self.encoder(encoder_input)
        
        encoder_output = encoder_output.transpose(0, 1)
        return encoder_output

#### Sequence Classification Head

In [27]:
class BertForSequenceClassification(nn.Module):
    def __init__(self, bert_model, num_labels):
        super(BertForSequenceClassification, self).__init__()
        self.bert = bert_model
        self.classifier = nn.Linear(bert_model.embeddings.word_embeddings.embedding_dim, num_labels)
    
    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
        sequence_output = self.bert(input_ids, token_type_ids, attention_mask)
        cls_token = sequence_output[:, 0, :]
        logits = self.classifier(cls_token)
        return logits


#### Using The Model

In [29]:
if __name__ == "__main__":
    # Mini fake vocab
    vocab = {
        "[PAD]": 0, "[CLS]": 101, "[SEP]": 102,
        "i": 2001, "love": 2002, "this": 2003, "!": 2004,
        "was": 2005, "terrible": 2006
    }

    def tokenize(sentence, vocab, max_len):
        tokens = ["[CLS]"] + sentence.lower().split() + ["[SEP]"]
        token_ids = [vocab.get(t, 0) for t in tokens]
        padding = [vocab["[PAD]"]] * (max_len - len(token_ids))
        return token_ids + padding

    # Input examples
    sentences = ["I love this!", "This was terrible"]
    labels = [1, 0]  # 1 = positive, 0 = negative

    # Preprocess
    max_len = 8
    input_ids = torch.tensor([tokenize(s, vocab, max_len) for s in sentences])
    token_type_ids = torch.zeros_like(input_ids)
    attention_mask = (input_ids != vocab["[PAD]"]).long()
    actual_labels = torch.tensor(labels)

    # Hyperparameters
    VOCAB_SIZE = max(vocab.values()) + 1
    HIDDEN_SIZE = 128  # smaller for demo
    MAX_POSITION_EMBEDDINGS = 512
    TYPE_VOCAB_SIZE = 2
    NUM_LABELS = 2
    BATCH_SIZE = len(sentences)
    SEQ_LENGTH = max_len

    # Initialize model
    bert_model = BertModel(
        vocab_size=VOCAB_SIZE,
        hidden_size=HIDDEN_SIZE,
        num_hidden_layers=2,
        num_attention_heads=4,
        intermediate_size=256,
        max_position_embeddings=MAX_POSITION_EMBEDDINGS,
        type_vocab_size=TYPE_VOCAB_SIZE
    )
    model = BertForSequenceClassification(bert_model, num_labels=NUM_LABELS)

    # Print model summary
    print("Model Summary:")
    print(summary(model, input_data=(input_ids, token_type_ids, attention_mask)))

    # Run forward pass
    logits = model(input_ids, token_type_ids, attention_mask)
    probs = F.softmax(logits, dim=1)
    predicted_labels = torch.argmax(probs, dim=1)

    # Output results
    print("\nInput Sentences:")
    for s in sentences:
        print(f'  "{s}"')

    print("\nActual Labels:   ", actual_labels.tolist())
    print("Predicted Labels:", predicted_labels.tolist())

Model Summary:
Layer (type:depth-idx)                             Output Shape              Param #
BertForSequenceClassification                      [2, 2]                    --
├─BertModel: 1-1                                   [2, 8, 128]               --
│    └─BertEmbeddings: 2-1                         [2, 8, 128]               --
│    │    └─Embedding: 3-1                         [2, 8, 128]               256,896
│    │    └─Embedding: 3-2                         [2, 8, 128]               65,536
│    │    └─Embedding: 3-3                         [2, 8, 128]               256
│    │    └─LayerNorm: 3-4                         [2, 8, 128]               256
│    │    └─Dropout: 3-5                           [2, 8, 128]               --
│    └─TransformerEncoder: 2-2                     [8, 2, 128]               --
│    │    └─ModuleList: 3-6                        --                        264,960
├─Linear: 1-2                                      [2, 2]                    258
Tot

## Using a Pre-Trained BERT Model Off The Shelf 

This first code example uses the bert-base-uncased model from Hugging Face. This version of BERT is pretrained on a large corpus of English text and includes the base transformer architecture and embeddings. However, its classification head is not fine-tuned for any specific task—in this case, sentiment analysis—so the weights of the final layer are randomly initialized. As a result, while the model structure is correct and capable, its predictions will not be meaningful until it is trained on labeled data for the downstream task.

In [30]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from torchinfo import summary
import torch.nn.functional as F

# Load pretrained tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Sample inputs and true labels
sentences = ["I love this!", "This was terrible"]
actual_labels = [1, 0]  # 1 = positive, 0 = negative

# Tokenize sentences for BERT
inputs = tokenizer(
    sentences,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=16
)

# Show model summary
# Pass the unpacked dict as a tuple for torchinfo
print("\nModel Summary:\n")
print(summary(model, input_data=(inputs["input_ids"], inputs["token_type_ids"], inputs["attention_mask"])))

# Inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probs = F.softmax(logits, dim=1)
    predicted_labels = torch.argmax(probs, dim=1)

# Print results
print("\nInput Sentences:")
for s in sentences:
    print(f'  "{s}"')

print("\nActual Labels:   ", actual_labels)
print("Predicted Labels:", predicted_labels.tolist())

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Model Summary:

Layer (type:depth-idx)                                  Output Shape              Param #
BertForSequenceClassification                           [2, 2]                    --
├─BertModel: 1-1                                        [2, 768]                  --
│    └─BertEmbeddings: 2-1                              [2, 6, 768]               --
│    │    └─Embedding: 3-1                              [2, 6, 768]               23,440,896
│    │    └─Embedding: 3-2                              [2, 6, 768]               1,536
│    │    └─Embedding: 3-3                              [1, 6, 768]               393,216
│    │    └─LayerNorm: 3-4                              [2, 6, 768]               1,536
│    │    └─Dropout: 3-5                                [2, 6, 768]               --
│    └─BertEncoder: 2-2                                 [2, 6, 768]               --
│    │    └─ModuleList: 3-6                             --                        85,054,464
│    └─BertPoole

## Using a Fine-Tuned Pre-Trained BERT Model

This second example uses distilbert-base-uncased-finetuned-sst-2-english, a compact version of BERT that has already been fine-tuned on the SST-2 dataset for binary sentiment classification. Because the entire model, including the classification head, has been trained for this specific task, it can immediately provide accurate sentiment predictions. This example demonstrates how a fine-tuned model can be used directly for inference without additional training or customization.

In [31]:
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torchinfo import summary

# Load fine-tuned model and tokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Input sentences and true labels
sentences = ["I love this!", "This was terrible"]
actual_labels = [1, 0]  # 1 = positive, 0 = negative

# Tokenize inputs
inputs = tokenizer(
    sentences,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=16
)

# Print model summary
print("\nModel Summary:\n")
print(summary(model, input_data=(inputs["input_ids"], inputs["attention_mask"])))

# Inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probs = F.softmax(logits, dim=1)
    predicted_labels = torch.argmax(probs, dim=1)

# Print results
print("\nInput Sentences:")
for s in sentences:
    print(f'  "{s}"')

print("\nActual Labels:   ", actual_labels)
print("Predicted Labels:", predicted_labels.tolist())


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]


Model Summary:

Layer (type:depth-idx)                                  Output Shape              Param #
DistilBertForSequenceClassification                     [2, 2]                    --
├─DistilBertModel: 1-1                                  [2, 6, 768]               --
│    └─Embeddings: 2-1                                  [2, 6, 768]               --
│    │    └─Embedding: 3-1                              [2, 6, 768]               23,440,896
│    │    └─Embedding: 3-2                              [1, 6, 768]               393,216
│    │    └─LayerNorm: 3-3                              [2, 6, 768]               1,536
│    │    └─Dropout: 3-4                                [2, 6, 768]               --
│    └─Transformer: 2-2                                 [2, 6, 768]               --
│    │    └─ModuleList: 3-5                             --                        42,527,232
├─Linear: 1-2                                           [2, 768]                  590,592
├─Dropout: 1-3

---

# Zero-Shot, Few-Shot, and Full Training

These approaches describe how pre-trained models can be adapted for specific downstream tasks. Depending on the amount of labeled data available and the complexity of the task, you can choose from different strategies: zero-shot, few-shot, and full training. Each method leverages the rich language representations learned during pre-training in different ways.

## Zero-Shot Learning

Zero-shot learning involves using a pre-trained model to make predictions on a new task without any additional fine-tuning. In this scenario, the model relies solely on the general language understanding it gained during pre-training. 

**Definition:**  
  The model is directly applied to a downstream task without any task-specific training.

**Use Case:**  
  This approach is particularly useful when no labeled data is available for the new task. For instance, you might want to classify text in a domain that was not present during the model's pre-training.

**Limitations:**  
  Since the model hasn't been tailored to the specific nuances of the new task, zero-shot predictions may be less accurate, especially if the task is very different from the data used during pre-training.


## Few-Shot Learning

Few-shot learning strikes a balance between leveraging pre-trained knowledge and adapting to a new task with a minimal amount of labeled data. Here, the model is fine-tuned on a small number of examples, which allows it to quickly learn the characteristics of the target task.

**Definition:**  
  The pre-trained model is fine-tuned using a very small number of examples from the target task.

**Use Case:**  
  This method is beneficial when labeled data is scarce or expensive to obtain. It allows the model to quickly adapt to a new task without the need for a large training set.

**Benefits:**  
  Due to the strong representations learned during pre-training, even a few examples can be enough to achieve reasonable performance on the target task.

**Challenges:**  
  The success of few-shot learning can be highly sensitive to the quality and representativeness of the few examples provided. If the examples are not well-chosen, the model might not generalize effectively to unseen data.


## Full Training (Fine-Tuning with Extensive Data)

Full training involves fine-tuning the pre-trained model on a large, task-specific dataset. This approach is typically used when abundant labeled data is available and the goal is to achieve the highest possible performance on the target task.

**Definition:**  
  The pre-trained model is further trained (fine-tuned) on a comprehensive dataset specific to the task at hand.

**Use Case:**  
  When ample labeled data is available, full training allows the model to capture the fine-grained nuances and patterns that are particular to the task, often resulting in state-of-the-art performance.

**Benefits:**  
  Extensive fine-tuning can significantly improve accuracy by aligning the model's representations more closely with the target task's requirements.

**Trade-Off:**  
  This approach demands more computational resources and carries a higher risk of overfitting if the training process is not carefully managed. It requires a balance between leveraging the pre-trained knowledge and adapting to the specific patterns in the labeled data.

---

# Decoder-Only Transformers

Decoder-only transformers are designed specifically for generating output sequences. They are the backbone of models used for text generation and text completion. Unlike encoder-only models, which focus on understanding and processing input text, decoder-only architectures are autoregressive, meaning they generate text one token at a time based solely on the tokens that have come before. A well-known example of this type of architecture is the GPT (Generative Pre-Trained Transformer) series.

![](Decoder_Only.png)

## Core Characteristics and Training

At the heart of a decoder-only transformer is its ability to generate text by predicting the next token in a sequence. This process involves a few key modifications compared to the bidirectional models like BERT:

- **Autoregressive Generation:**  
  In decoder-only models, the prediction of each token depends only on the tokens that have already been generated. This is in contrast to models like BERT, which use bidirectional context and can attend to tokens both before and after a given position.

- **Causal Masking:**  
  A crucial component of training a decoder-only transformer is the implementation of causal masking in the self-attention mechanism.  
  - **How It Works:**  
    During training, the model computes attention scores between all pairs of tokens. To ensure that a token only considers previous tokens (and not any future tokens), the upper triangular portion of the attention score matrix is set to negative infinity.  
  - **Impact of Masking:**  
    When the softmax function is applied to these scores, the masked positions effectively contribute a weight of zero. This prevents the model from "cheating" by looking ahead in the sequence, thereby enforcing the autoregressive property.

- **Parallelization:**  
  Despite being autoregressive, the transformer architecture allows for parallel processing during training. Unlike RNNs, which process tokens sequentially, all tokens in a sequence can be processed simultaneously because the causal mask is applied uniformly. This parallelized computation reduces training time and makes it feasible to train very large models on extensive datasets.
  

## Advantages Over RNN-Based Models

Decoder-only transformers offer several significant benefits compared to traditional RNN-based language models:

- **Parallel Training:**  
  The self-attention mechanism allows the model to compute representations for all tokens concurrently, which dramatically speeds up training and makes it scalable to larger datasets and models.
  
- **Stable Training Dynamics:**  
  With features like layer normalization and residual connections, transformer architectures avoid the common pitfalls of RNNs such as vanishing or exploding gradients.
  
- **Short Effective Path Length:**  
  In transformer models, every token in the sequence is directly connected to every other token within a single layer, meaning that the effective path length—the number of computational steps needed for information to travel from one token to any other—is always 1. 
  

## GPT Models

The GPT family is the most prominent example of decoder-only transformers. Generative Pre-Trained Transformers (GPT) are specifically designed for tasks that require text generation, such as:

- **Text Generation and Completion:**  
  GPT models generate coherent and contextually relevant text by predicting one token at a time.
  
- **Conversational AI and Chatbots:**  
  Their ability to generate human-like text makes them suitable for building interactive conversational systems.

- **Language Modeling:**  
  They serve as powerful language models capable of understanding and generating natural language.

Examples in the GPT series include GPT-1, GPT-2, and GPT-3. More recent models like GPT-3.5 and GPT-4 are proprietary and may incorporate ensemble techniques or additional refinements, but they continue to build upon the core principles of the autoregressive, decoder-only approach.

## GPT Model with Hugging Face

In this example, we use GPT-2 (a fully open-source GPT model) to generate text. This snippet loads the model and tokenizer, encodes a prompt, and generates text with some sampling options.

In [4]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model and tokenizer
model_name = "gpt2"  # You can also experiment with other models like 'EleutherAI/gpt-neo-125M'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Define a prompt for text generation
prompt = "Once upon a time in a land far away"
inputs = tokenizer.encode(prompt, return_tensors="pt")

# Generate text using sampling (do_sample=True) for diversity
output = model.generate(
    inputs,
    max_length=100,         # maximum length of the generated sequence
    do_sample=True,         # use sampling instead of greedy decoding
    top_k=50,               # limit the number of highest probability tokens to consider
    top_p=0.95,             # use nucleus sampling to consider tokens with cumulative probability 0.95
    num_return_sequences=1  # generate one sequence
)

# Decode and print the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Text:\n", generated_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:
 Once upon a time in a land far away, where the sea has come out of the sea, and the mountains have gone down to earth for a long time, where the air has not breathed. The seas and the land are like the wind; and the air is the wind. I say to you that my dream has not an end, and I will come again. Let them come to me, and be glad to hear you.

CHAPTER XIII


O n the sea, I


## Chat-GPT 4o-mini with OpenAI

This example demonstrates how to use the OpenAI Python API to interact with ChatGPT-4. The example sends a conversation to the model and prints out the response. You will need an API key for this to work, which can be obtained from https://platform.openai.com/docs/overview

In [5]:
from openai import OpenAI

# Initialize the OpenAI client
client = OpenAI(api_key=api_key)

# Make a request to GPT-4
response = client.chat.completions.create(
    model="gpt-4o-mini-2024-07-18",
    messages=[{"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": "Tell me a joke."}]
)

# Print the response
print(response.choices[0].message.content)

Why did the scarecrow win an award?

Because he was outstanding in his field!


---

# Encoder-Decoder Transformers

Encoder-decoder transformers, originally introduced in "Attention is All You Need", are designed to handle tasks where the input is transformed into a different output sequence. This architecture is particularly effective for sequence-to-sequence tasks such as machine translation, summarization, and text generation. In these models, the encoder processes the input text into a rich, context-aware representation, while the decoder generates the output text based on this representation and the previously generated tokens.

The overall process is as follows:
- **Encoder:**  
  The encoder takes the input text and processes it through several layers of self-attention and feed-forward networks. This results in a set of high-dimensional representations that capture the contextual relationships between tokens.
  
- **Decoder:**  
  The decoder generates the output text in an autoregressive manner. It uses a combination of masked self-attention (to ensure that each token is generated only based on preceding tokens) and cross-attention (to integrate information from the encoder's output). This allows the model to produce coherent and contextually relevant output sequences.


## T5 (Text-to-Text Transfer Transformer)

T5 takes a unified approach by recasting all NLP tasks as text-to-text problems. This means that no matter what the original task is — be it translation, summarization, or classification — both the input and output are treated as text. This unified framework simplifies the training and fine-tuning process considerably.

- **Unified Framework:**  
  Every task is converted into a format where the model receives text as input and generates text as output. For example, a classification task might involve formatting the input as "Classify: [text]" and expecting a textual label as the output.
  
- **Pre-training and Fine-Tuning:**  
  T5 is pre-trained on a large corpus with various unsupervised tasks, learning a broad understanding of language. It is then fine-tuned on specific tasks, which allows it to achieve state-of-the-art performance across a wide range of benchmarks.
  
- **Benefits:**  
  The text-to-text approach unifies the model architecture across tasks, making it easier to transfer learning from one task to another and reducing the need for separate, task-specific models.
  
![](https://miro.medium.com/v2/resize:fit:1400/0*-MxKkmD7pRHnc0gx.png)


## BART (Bidirectional and Auto-Regressive Transformers)

BART is designed to leverage the strengths of both bidirectional and autoregressive models. It combines a bidirectional encoder, similar to BERT, with an autoregressive decoder, similar to GPT, which makes it highly effective for tasks that require both understanding and generating text.

- **Hybrid Architecture:**  
  BART uses an encoder-decoder structure where the encoder is capable of capturing context from the entire input sequence (bidirectional attention) and the decoder generates output in an autoregressive manner (ensuring each generated token is based only on previous tokens).
  
- **Denoising Autoencoder Objective:**  
  During training, BART is tasked with reconstructing the original text from a deliberately corrupted version. This denoising objective helps the model learn robust representations of text and improves its performance in generating coherent and fluent outputs.
  
- **Applications:**  
  The dual nature of BART—being good at both understanding and generating text—makes it a versatile model for a variety of tasks including summarization, translation, and general text generation.

![](BART.jpg)

# Challenges and Future Directions

Despite their impressive performance, transformer models face several key challenges that continue to drive research in the field. These challenges include computational issues and scalability, interpretability, and domain adaptation. 

## Scalability

One of the primary computational challenges with transformers is the quadratic growth in the computation of attention scores relative to the sequence length. Since every token attends to every other token, the self-attention mechanism has a time complexity of O(n²), where n is the number of tokens. This becomes a significant bottleneck when processing long sequences.

- **Computational Bottleneck:**  
  As the sequence length increases, the number of pairwise comparisons grows rapidly, leading to high memory usage and slower computation. This limits the practicality of applying standard transformers to very long texts or sequences.

- **Approximating Attention Scores:**  
  Researchers are exploring various methods to reduce this computational load:
  - **Sparse Attention Mechanisms:**  
    Instead of calculating attention scores for every token pair, sparse attention restricts the connections to a subset of tokens. This creates a sparser graph of attention relationships, which reduces the overall number of computations.
  - **Efficient Attention Approximations:**  
    Methods such as low-rank approximations, kernel-based techniques, and random feature mappings (e.g., the Performer model) aim to approximate the full softmax attention more efficiently. These techniques can sometimes reduce the complexity from quadratic to linear or near-linear time while still preserving much of the model’s effectiveness.

## Interpretability

Transformers, especially large-scale models with billions of parameters, often function as "black boxes." Understanding the decision-making process and the roles of individual parameters or attention heads is a significant challenge.

- **Opaque Decision-Making:**  
  The internal workings of transformers are complex, making it difficult to trace how specific outputs are generated from a given input. The sheer size of modern models complicates efforts to assign meaning to individual weights or neurons.
  
- **Techniques for Interpretability:**  
  Several methods are being developed to gain insights into transformer behavior:
  - **Visualization of Attention Weights:**  
    By examining the attention weights, researchers can sometimes identify patterns in how the model focuses on different parts of the input. However, these visualizations do not always provide a complete explanation of the model’s decisions.
  - **Attention Rollouts and Attribution Methods:**  
    Techniques such as attention rollouts, gradient-based attributions, and integrated gradients help in understanding the contribution of various tokens and layers. These methods can shed light on which parts of the input are most influential for the model's predictions.
  - **Investigating Multi-Head Attention:**  
    Researchers are also analyzing the roles of different attention heads within multi-head attention. Some heads may focus on syntactic relationships while others capture semantic nuances. This investigation helps in understanding how the model decomposes and processes information, although a complete picture remains elusive.

## Domain Adaptation

Domain adaptation refers to the challenge of applying a pre-trained transformer model, which is typically trained on large general corpora, to specialized domains that may have unique terminology or stylistic differences.

- **Generalization Issues:**  
  Models pre-trained on broad datasets might not perform optimally when applied to domain-specific tasks. The language and context in specialized fields (e.g., medical, legal, or technical documents) can differ significantly from general text.
  
- **Strategies for Domain Adaptation:**  
  To address these issues, several approaches are being pursued:
  - **Fine-Tuning on Domain-Specific Data:**  
    One common approach is to fine-tune the pre-trained model on a smaller, domain-specific dataset. This helps the model adjust its representations to better capture the nuances of the target domain.
  - **Domain-Specific Pre-Training:**  
    In some cases, researchers pre-train models from scratch or continue pre-training on domain-specific corpora to create a specialized model.
  - **Adapter Modules:**  
    Adapter modules are lightweight, task-specific layers that can be inserted into a pre-trained model. These adapters allow for efficient domain adaptation without the need to fine-tune the entire model, thus preserving the general knowledge while incorporating domain-specific nuances.

# Real World ChatBot Example

In [None]:
!pip install streamlit

In [None]:
import streamlit as st
from openai import OpenAI

# Set up OpenAI client
client = OpenAI(api_key=api_key)  # Replace with your OpenAI API key

# Streamlit UI
st.title("💬 OpenAI Chatbot")
st.write("A simple chatbot powered by OpenAI.")

# Ensure session state is initialized
if "messages" not in st.session_state:
    st.session_state["messages"] = []

# Display chat history using Streamlit's `st.chat_message`
for msg in st.session_state["messages"]:
    with st.chat_message(msg["role"]):
        st.write(msg["content"])

# User input
if user_input := st.chat_input("Type your message..."):
    # Add user input to chat history
    st.session_state["messages"].append({"role": "user", "content": user_input})

    # Display user message in chat interface
    with st.chat_message("user"):
        st.write(user_input)

    # Show a loading indicator while waiting for OpenAI's response
    with st.spinner("Thinking..."):
        response = client.chat.completions.create(
            model="gpt-4",
            messages=st.session_state["messages"]
        )

    # Get AI response
    ai_response = response.choices[0].message.content

    # Add AI response to chat history
    st.session_state["messages"].append({"role": "assistant", "content": ai_response})

    # Display AI response in chat interface
    with st.chat_message("assistant"):
        st.write(ai_response)

In [18]:
!streamlit run chatbot.py

[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://192.168.1.4:8501[0m
[0m
[34m[1m  For better performance, install the Watchdog module:[0m

  $ xcode-select --install
  $ pip install watchdog
            [0m
^C
[34m  Stopping...[0m
