Install Hugging Face Transformers

In [4]:
# pip install transformers

Collecting transformers
  Downloading transformers-4.44.2-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.23.2 (from transformers)
  Downloading huggingface_hub-0.24.6-py3-none-any.whl.metadata (13 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2024.7.24-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting safetensors>=0.4.1 (from transformers)
  Downloading safetensors-0.4.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers)
  Downloading tokenizers-0.19.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.44.2-py3-none-any.whl (9.5 MB)
[

# Decoder-Only

The following example uses the Hugging Face library to illustrate a decoder-only Transformer, specifically using the GPT-2 model (which is a decoder-only Transformer). This example loads GPT-2, inputs a prompt, and generates text.

Steps:
1. Load the pre-trained GPT-2 model and tokenizer.
2. Generate text using a given prompt.

In step 1, we use `transformers.GPT2Tokenizer`, which tokenizes the input text, converting it into token IDs that the model can understand. In step 2, we use `GPT2LMHeadModel`, the pre-trained GPT-2 model used for autoregressive text generation.

In [6]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

# Load the GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Set the model in evaluation mode
model.eval()

# Input prompt
prompt = "In a future world, humans and robots"

# Tokenize the input prompt
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# Generate text using GPT-2
# Parameters:
# - max_length: Maximum number of tokens to generate (including the prompt)
# - num_return_sequences: Number of generated sequences (1 by default)
# - do_sample: Whether to use sampling (True for more creativity, False for deterministic output)
# - pad_token_id: see https://stackoverflow.com/questions/69609401/suppress-huggingface-logging-warning-setting-pad-token-id-to-eos-token-id
output = model.generate(input_ids, max_length=50, num_return_sequences=1, do_sample=True, pad_token_id=tokenizer.eos_token_id)

# Decode the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

# Print the generated text
print("Generated Text:\n", generated_text)

Generated Text:
 In a future world, humans and robots would need to compete to provide jobs to everyone, so they could help to spread a global economic health crisis.

But many of the concerns raised by robots have not been fully addressed, and could easily be


# Encoder-Only

This example uses the Hugging Face library (transformers) to illustrate the use of an encoder-only Transformer, specifically using BERT for a text classification task (e.g., sentiment analysis).

Steps:
- Load a pre-trained BERT model for sequence classification.
- Tokenize the input text.
- Pass the tokenized input to the BERT model for predictions.

Some notes:
- The BERT model used here is pre-trained on a large corpus and then fine-tuned for a classification task. You can also fine-tune BERT on your own dataset for custom tasks.
- The bert-base-uncased model is case-insensitive, meaning it converts all text to lowercase before processing.

In [7]:
# Import necessary libraries
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load pre-trained BERT tokenizer and model for sequence classification (binary task)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Example input text
text = "Hugging Face is doing amazing work!"

# Tokenize the input text for BERT
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)

# Run the input through the model (without gradient calculation)
with torch.no_grad():
    outputs = model(**inputs)

# Extract the logits (raw predictions)
logits = outputs.logits

# Convert logits to probabilities using softmax
probs = torch.softmax(logits, dim=-1)

# Get the predicted label (0 or 1)
predicted_label = torch.argmax(probs).item()

# Print the result
print(f"Predicted Label: {predicted_label}, Probabilities: {probs}")


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Predicted Label: 1, Probabilities: tensor([[0.4587, 0.5413]])


# Encoder-Decoder

This short example illustrates an encoder-decoder Transformer model. For this example, we'll use the T5 (Text-to-Text Transfer Transformer) model, which is a popular encoder-decoder architecture for various text generation tasks, like text summarization.

Some notes:
- The T5 model expects input in a tokenized format. The T5Tokenizer converts the input text into token IDs, which the model can process.
- `T5ForConditionalGeneration` is the model class for sequence-to-sequence tasks (e.g., summarization, translation) in T5.
- Summarization: We add the task prefix summarize: to instruct the model to summarize the input text. T5 supports multiple tasks, and the task prefix conditions the model for the desired task.
- Beam Search: The num_beams=4 argument in generate() ensures that the model considers four possible sequences at each step, enhancing the quality of the output.
- This example uses the T5 encoder-decoder architecture for the task of text summarization.
- It demonstrates how the encoder processes the input text and the decoder generates the output (summary).
- You can switch to other tasks like translation by simply changing the prefix (e.g., translate English to French:).

In [2]:
#pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Downloading sentencepiece-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.2.0
Note: you may need to restart the kernel to use updated packages.


In [1]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load pre-trained T5 model and tokenizer
model_name = "t5-small"  # You can use 't5-base', 't5-large', etc.
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Define the input text to be summarized
input_text = """
The Transformer model, introduced in 2017, revolutionized natural language processing by replacing recurrent and convolutional networks. 
It relies on self-attention mechanisms to capture long-range dependencies in the data. The architecture consists of an encoder and decoder stack, 
making it suitable for a wide range of sequence-to-sequence tasks such as translation, summarization, and more.
"""

# Add the 'summarize:' prefix for T5 (it requires a task prefix for conditioning)
input_text_with_prefix = "summarize: " + input_text

# Tokenize input text
input_ids = tokenizer.encode(input_text_with_prefix, return_tensors="pt", max_length=512, truncation=True)

# Generate the summary (the max_length of the summary can be adjusted)
summary_ids = model.generate(input_ids, max_length=50, num_beams=4, early_stopping=True)

# Decode the generated summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Output the summary
print("Original Text: ", input_text)
print("\nSummary: ", summary)


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Original Text:  
The Transformer model, introduced in 2017, revolutionized natural language processing by replacing recurrent and convolutional networks. 
It relies on self-attention mechanisms to capture long-range dependencies in the data. The architecture consists of an encoder and decoder stack, 
making it suitable for a wide range of sequence-to-sequence tasks such as translation, summarization, and more.


Summary:  the Transformer model, introduced in 2017, revolutionized natural language processing. it relies on self-attention mechanisms to capture long-range dependencies.


# References

1. [Attention Is All You Need](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)
2. [The Animated Transformer](https://prvnsmpth.github.io/animated-transformer/)
3. [What is the positional encoding in the transformer model?](https://datascience.stackexchange.com/questions/51065/what-is-the-positional-encoding-in-the-transformer-model)
4. [Positional Encoding](https://medium.com/@hunter-j-phillips/positional-encoding-7a93db4109e6)
5. [Query, Key and Value Matrix for Attention Mechanisms in Large Language Models](https://www.youtube.com/watch?v=0XH0B8uMPKA)
6. [ChatGPT’s vocabulary: The words that ChatGPT knows and how they were chosen](https://emaggiori.com/chatgpt-vocabulary/)
7. [How to understand contextualized embeddings in Transformer?](https://stackoverflow.com/questions/77605657/how-to-understand-contextualized-embeddings-in-transformer)
[Attention in transformers, visually explained | Chapter 6, Deep Learning](https://www.youtube.com/watch?v=eMlx5fFNoYc)
8. [A Comprehensive Overview of Large Language Models](https://ar5iv.labs.arxiv.org/html/2307.06435)
9. [How positional encoding in transformers works?](https://www.youtube.com/watch?v=T3OT8kqoqjc)
10. [Building LLMs from the Ground Up: A 3-hour Coding Workshop](https://www.youtube.com/watch?v=quh7z1q7-uc)
11. [What exactly are keys, queries, and values in attention mechanisms?](https://stats.stackexchange.com/questions/421935/what-exactly-are-keys-queries-and-values-in-attention-mechanisms)
12. [Transformer Explainer](poloclub.github.io/transformer-explainer)
13. [Tutorial 6: Transformers and Multi-Head Attention](https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html)
14. [Rasa Algorithm Whiteboard - Transformers & Attention 1: Self Attention](https://www.youtube.com/watch?v=yGTUuEx3GkA)