# **Improving Natural Language Processing with Attention Mechanisms (Part 2/3)**

In [1]:
from IPython.display import Image

## **Building large-scale language models by leveraging unlabeled data**

One common characteristic of large-scale language models is their ability to leverage vast amounts of unlabeled text data for training. This is typically achieved through self-supervised learning techniques, where the model learns to predict missing words or the next word in a sequence based on the surrounding context. By training on large corpora of text, these models can capture complex linguistic patterns and semantic relationships, enabling them to generate coherent and contextually relevant text.

After pre-training on unlabeled data, these models can be fine-tuned on smaller, labeled datasets for specific downstream tasks such as sentiment analysis, question answering, or machine translation. This two-step process allows the model to benefit from the general language understanding acquired during pre-training while adapting to the nuances of the target task during fine-tuning.

### **Pre-training and fine-tuning transformer models**

- `Self-supervised learning` techniques, such as masked language modeling (MLM) and autoregressive language modeling (ALM), are commonly used for pre-training transformer models. In MLM, the model is trained to predict randomly masked words in a sentence based on their context, while in ALM, the model predicts the next word in a sequence given the preceding words.

- These pre-training tasks enable the model to learn rich representations of language, which can then be fine-tuned for specific applications. During fine-tuning, the pre-trained model is further trained on a smaller labeled dataset, allowing it to adapt its learned representations to the specific requirements of the target task. This approach has been shown to significantly improve performance across a wide range of natural language processing tasks.

- `Self-supervised learning` is traditionally also referred to as `unsupervised learning`, as it does not require labeled data for training. However, the term "self-supervised" emphasizes the model's ability to generate its own supervision signals from the input data itself, making it a more accurate description of the learning process involved.


- The main idea of pre-training and fine-tuning transformer models is to first train the model on a large corpus of unlabeled text data using self-supervised learning techniques, allowing it to learn general language representations. Then, the model is fine-tuned on smaller labeled datasets for specific downstream tasks, enabling it to adapt its learned knowledge to the requirements of those tasks. This two-step process has proven effective in improving the performance of transformer models across various natural language processing applications.

- A complete training procedure of a transformer-based model consists of two main phases:
  - pre-training on a large corpus of unlabeled text data using self-supervised learning techniques, such as masked language modeling (MLM) or autoregressive language modeling (ALM).
  - fine-tuning on smaller labeled datasets for specific downstream tasks, such as sentiment analysis, question answering, or machine translation.

- The fine-tuning approach, updates the pre-trained model's parameters in a regular supervised learning manner via backpropagation, using labeled data specific to the target task. This allows the model to adapt its learned representations to the nuances of the task at hand, improving its performance on that specific application.

- the two-stages of training transformer models, and difference between the feature-based approach and fine-tuning approach for adapting pre-trained models to downstream tasks.

![transformer-pretrain-finetune](./figures/16_10.png)

### **Leveraging unlabeled data with GPT**

- GPT (Generative Pre-trained Transformer) is a transformer-based language model that leverages large amounts of unlabeled text data for pre-training using an autoregressive language modeling objective. During pre-training, GPT learns to predict the next word in a sequence given the preceding words, allowing it to capture complex linguistic patterns and semantic relationships in the data.

**GPT-1 Architecture:**

It's training procedure consists of two main phases:
1. Pre-training on a large corpus of unlabeled text data using an autoregressive language modeling
2. Supervised fine-tuning on smaller labeled datasets for specific downstream tasks.

![gpt-architecture](./figures/16_11.png)


- GPT-1 as a transformer consists of 
  - A decoder (and without an encoder block), and
  - An additional layer that is added later for the supervised fine-tuning phase.


- During pre-training, GPT-1 utilizes a transformer decoder structure, where, at a given word position, the model only relies on preceding words to predict the next word. This is achieved through masked self-attention mechanisms that prevent the model from attending to future words in the sequence.

- GPT-1 utilizes a `unidirectional self-attention mechanism`, as opposed to the `bidirectional self-attention mechanism` used in models like BERT. This means that, during pre-training, GPT-1 only attends to preceding words in the sequence when predicting the next word, whereas BERT attends to both preceding and following words.

- Difference between zero-shot, one-shot, few-shot, and many-shot learning in the context of GPT models:
  - `Zero-shot learning`: The model is tested on a task without any prior examples or training specific to that task. It relies solely on its pre-trained knowledge to make predictions.
  
  - `One-shot learning`: The model is provided with a single example of the task during testing. It uses this example to adapt its predictions for the task at hand.
  
  - `Few-shot learning`: The model is given a small number of examples (typically ranging from 2 to 10) during testing. It leverages these examples to better understand the task and improve its predictions.
  
  - `Many-shot learning`: The model is trained on a larger number of examples specific to the task, allowing it to learn more effectively and make more accurate predictions.
  
![comparison-zero-one-few-many-shot](./figures/16_12.png)


---

### **Using GPT-2 to generate new text**

- We will be accessing GPT-2 via the `transformers` library by Hugging Face, which provides pre-trained models and tools for natural language processing tasks.

In [3]:
import transformers
print(transformers.__version__)

4.57.1


- Import a pre-trained GPT model that can generate text:

In [None]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')

- Then, we can prompt the model with a text snippet and ask it to generate new text based on that input snippet.

In [None]:
set_seed(123)
generator("Hey readers, today is",
          max_length=20,
          num_return_sequences=3)

---

### **Bidirectional pre-training with BERT**

- `BERT` (Bidirectional Encoder Representations from Transformers) is a transformer-based language model that utilizes bidirectional self-attention mechanisms for pre-training. Unlike GPT, which employs a unidirectional approach, BERT considers both preceding and following words in a sequence when predicting masked words during pre-training.

- `BERT` is `nondirectional` training beacause it reads in all input elements all at once, rather than one at a time in a sequence. The encoding of a certain word depends on both the preceding and the succeeding words.


![Preparing the input for BERT](./figures/16_13.png)

- During pre-training, BERT uses a `masked language modeling` (MLM) objective, where a certain percentage of words in the input sequence are randomly masked, and the model is trained to predict these masked words based on their context. This allows BERT to learn rich representations of language that capture both local and global context.


`BERT is a bidirectional transformer pretrained on unlabeled text to predict masked tokens in a sentence and to predict whether one sentence follows another. The main idea is that by randomly masking some tokens, the model can train on text to the left and right, giving it a more thorough understanding. BERT is also very versatile because its learned language representations can be adapted for other NLP tasks by fine-tuning an additional layer or head.`

- In the `masked language model (MLM)`, tokens are randomly replaced by so-called mask tokens, `[MASK]`, and the model is required to predict these hidden words. Compared with the next-word prediction in `GPT`, `MLM` in `BERT` is more akin to “filling in the blanks” because the model can attend to all tokens
in the sentence (except the masked ones). 

![MLM](./figures/16_14.png)

- Next-sentence prediction is a natural modification of the next-word prediction task considering the bidirectional encoding of BERT. In fact, many important NLP tasks, such as question answering, depend on the relationship of two sentences in the document. This kind of relationship is hard to capture via
regular language models because next-word prediction training usually occurs on a single-sentence level due to input length constraints.


![Using BERT to fine-tune different NLP tasks](./figures/16_15.png)


### The best of both worlds: BART

![BART architecture](./figures/16_16.png)