# Transfomers
- Transformers are perhaps one of the most special, impactful and important creations in AI/Computer Science of **all time**.
- Transformer models are used to solve all kinds of tasks across different modalities, including natural language processing (NLP), computer vision, audio processing, and more.

### Hugging Face Transformers Library

The **Hugging Face Transformers** library is the most important and widely-used library for working with transformer models in the AI/ML ecosystem.

### What is it?
- A comprehensive Python library that provides easy access to thousands of pre-trained transformer models
- Supports all major transformer architectures: BERT, GPT, T5, RoBERTa, DistilBERT, and many more
- Provides a unified, consistent API regardless of the underlying model architecture
- Includes tools for fine-tuning, training from scratch, and deploying models to production

### Why it's significant
The library democratized AI by making state-of-the-art transformer models accessible to developers, researchers, and companies worldwide. It created a standardized interface that works across different model families and tasks, building the largest model hub with 100,000+ pre-trained models shared by the community. This massive ecosystem is now used by thousands of companies in real-world applications.

### The Pipeline Function - The Game Changer

- The `pipeline()` function is the most simple object in the transformers library, yet perhaps the most revolutionary feature - it's like having a "one-click" solution for AI tasks. It handles tokenization, model loading, and post-processing automatically, supporting 20+ different tasks (classification, generation, translation, summarization, etc.) with zero configuration needed.
- You can think of it as something that connects a model with its necessary preprocessing and postprocessing steps. This lets us directly input any text and get an intelligible answer.
- Below I will show you an example of sentiment analysis using the pipeline function, on our own inputs.

In [1]:
%pip install -r transformers_requirements.txt

Collecting matplotlib (from -r transformers_requirements.txt (line 4))
  Using cached matplotlib-3.10.3-cp310-cp310-macosx_11_0_arm64.whl.metadata (11 kB)
Collecting seaborn (from -r transformers_requirements.txt (line 5))
  Using cached seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting contourpy>=1.0.1 (from matplotlib->-r transformers_requirements.txt (line 4))
  Using cached contourpy-1.3.2-cp310-cp310-macosx_11_0_arm64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib->-r transformers_requirements.txt (line 4))
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib->-r transformers_requirements.txt (line 4))
  Using cached fonttools-4.58.4-cp310-cp310-macosx_10_9_universal2.whl.metadata (106 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib->-r transformers_requirements.txt (line 4))
  Using cached kiwisolver-1.4.8-cp310-cp310-macosx_11_0_arm64.whl.metadata (6.2 kB)
Collecting pillow>=8 (from matplotlib-

In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis") # there are SO many differents tasks you could name here
# by default, a particular pretrained model that has been fine-tuned for sentiment analysis in English gets chosen
# here, sentinment-analysis defaults to distilbert-base-uncased-finetuned-sst-2-english,
# which is part of the BERT model lines (by Google)
classifier("I'm so hungry man!")

  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'label': 'NEGATIVE', 'score': 0.9931273460388184}]

In [5]:
# Can even do multiple outputs
classifier(
    ["I'm so hungry man.", "I love food so much."]
)

[{'label': 'NEGATIVE', 'score': 0.9961236119270325},
 {'label': 'POSITIVE', 'score': 0.9998375177383423}]

It's worth noting that the model gets downloaded and cached when you create the classifier object. Now, everytime you rerun the command, the cached model gets used instead, no need for repeated downloads.
Here's what happens everytime you pass some text into a pipeline:
- The text is preprocessed into a format the model can understand.
- The preprocessed inputs are passed to the model.
- The predictions of the model are post-processed, so you can make sense of them.

It's also worth noting the different tasks you can do with the pipeline object. Below are some examples:

Text pipelines
- text-generation: Generate text from a prompt
- text-classification: Classify text into predefined categories
- summarization: Create a shorter version of a text while preserving key information
- translation: Translate text from one language to another
- zero-shot-classification: Classify text without prior training on specific labels
- feature-extraction: Extract vector representations of text

Image pipelines
- image-to-text: Generate text descriptions of images
- image-classification: Identify objects in an image
- object-detection: Locate and identify objects in images

Audio pipelines
- automatic-speech-recognition: Convert speech to text
- audio-classification: Classify audio into categories
- text-to-speech: Convert text to spoken audio

Multimodal pipelines
- image-text-to-text: Respond to an image based on a text prompt

Now, lets try classifying texts that haven't been labelled. We're going to use something called zero-shot classification, as it allows us to specify which labels to use for classification, so we don't have to rely on pretrained models' labels. 
- (The pipeline is called zero-shot because you don’t need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want.)

In the above example, we classified using 2 labels: positive and negative. But now I'll show you how to classify text using ANY set of labels of your choice.

In [6]:
classifier = pipeline("zero-shot-classification")
classifier(
    "History isn't really a fun class",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


{'sequence': "History isn't really a fun class",
 'labels': ['education', 'business', 'politics'],
 'scores': [0.7267491817474365, 0.19325977563858032, 0.07999099045991898]}

You can also choose a specific model for a specific task!

In [10]:
generator = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-360M") # all these models names
# are names directly from the hugging face hub's website. You can plug in ANY model from their hub.
# It includes all possible basic models too, that you could ever think of.

# This is text generation, where you just provide some prompt, and the model auto-completes it by generating the remaining text.
# Similar to text predictions you see on your phone.
# Text generation is random, and it is unlikely you will get the exact same responses every time.

generator(
    "So today, I just feel like",
    max_length=30,
    num_return_sequences=2,
)

Device set to use mps:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': "So today, I just feel like I need to make sure that I am getting all of my 2016 goals accomplished. \xa0I have my goals for 2016, but I didn't get everything accomplished. \xa0So, I want to make sure that I can see what I have accomplished.\n\nI need to get more into the habit of writing in my journal, I think that will help me get through a lot of things. \xa0I know that when I feel like I am not feeling good, I can just go into my journal and take a look at what I should be doing. \xa0It will help me see where I need to improve.\n\nI am going to post the 2016 goals that I have listed in this post. \xa0I have listed the goals for each day, and will have a few goals for each week and month.\n\nI want to be able to see these goals, so I can see where I am and where I need to improve. \xa0I know that if I don't see these, I will feel discouraged and will not be able to get through a lot of things.\n\nI want to be able to see how far I have come, so even if I didn't a

There are SO many other tasks and examples I could show you, that I will not be exhaustively going over them here.

## Combining data from multiple sources
One powerful application of Transformer models is their ability to combine and process data from multiple sources. This is especially useful when you need to:

- Search across multiple databases or repositories
- Consolidate information from different formats (text, images, audio)
- Create a unified view of related information

For example, you could build a system that:
- Searches for information across databases in multiple modalities like text and image.
- Combines results from different sources into a single coherent response. For example, from an audio file and text description.
- Presents the most relevant information from a database of documents and metadata.

This is an import aspect of transformers, and is just something worth noting.
Next, we're going to get into the really good stuff, the nitty-gritty, how transformers ACTUALLY work, whats goes on in the inside, and transformer model architecture, in general.

# Transformer Architecture
- Transformer architecture is mainly composed of 2 blocks/layers/models:
  - Encoder layer (left): The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.
  - Decoder (right): The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs. 
- Here's an image for visualisation

![Transformer Architecture Visualisation](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers_blocks-dark.svg)
- Each of these parts can be used independently too, depending on the task:
  - Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.
  - Decoder-only models: Good for generative tasks such as text generation.
  - Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization (what you see with pure ChatGPT).
- Another key aspect of Transformer models are **Attention layers**.
  - By the way, the transformer architecture was introduced in a Google paper titled "Attention is All You Need". Says a lot.
-  This layer tells the model to pay attention **specifically** to certain parts of the data (e.g. with LLMs, that would be words in the sentence) that you passed it (and more or less ignore the others), when dealing with the representation of each piece of data (word). 
    - Here's an example: imagine you have to translate "You like this house" from English to French. The translation model will need to look at "you" to get the proper translation for "like", because in French, the word "like" is written differently depending on the subject. But, the rest of the sentence is not useful for the translation of "like".
    - In the same way, when translating “this” the model will also need to pay attention to the word "house", because “this” translates differently, depending on if the noun is masculine or feminine. Again, the other words in the sentence will not matter for the translation of “house”. 
    - With more complex sentences (and more complex grammar rules), the model would need to pay special attention to words that might appear farther away in the sentence to properly translate each word.
    - This same concept applies to ANY task at ANY scale associated with natural language: a word by itself has a meaning, but that meaning is deeply affected by the context, which can be any other word (or words) before or after the word being studied.
    - Here's a code visualization and application of our example:

In [15]:
from transformers import MarianMTModel, MarianTokenizer
import torch
import matplotlib.pyplot as plt
import seaborn as sns

# Load model and tokenizer
model_name = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Prepare input
src_text = "You like this house"
inputs = tokenizer(src_text, return_tensors="pt")

# Set attention implementation to eager (must be done before forward)
model.config.attn_implementation = "eager"

# Forward pass to get encoder attentions
with torch.no_grad():
    output = model(**inputs, output_attentions=True)

# Get translation
translated = model.generate(**inputs)
fr_text = tokenizer.decode(translated[0], skip_special_tokens=True)
print("French translation:", fr_text)

# Get attention from last encoder layer
attentions = output.encoder_attentions[-1][0]  # shape: [num_heads, seq_len, seq_len]

# Average over heads for visualization
avg_attn = attentions.mean(dim=0)

# Clean token labels
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
tokens = [tok for tok in tokens if tok not in ("</s>", "<pad>")]
avg_attn = avg_attn[:len(tokens), :len(tokens)]

# Plot
plt.figure(figsize=(8, 6))
sns.heatmap(avg_attn.numpy(), xticklabels=tokens, yticklabels=tokens, cmap="viridis", annot=True, fmt=".2f")
plt.title("Average Encoder Self-Attention (Last Layer)")
plt.xlabel("Key")
plt.ylabel("Query")
plt.tight_layout()
plt.show()

ValueError: You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time

- In the above image, we see a self-attention matrix from BERT (Layer 1, Head 1). Here’s how to read it:
	-	Y-axis (Query): The token whose attention is being calculated (e.g. “like”, “house”).
	-	X-axis (Key): The tokens that receive attention from the query.
	-	Color intensity: Strength of attention weight (brighter = more attention).

In short, a row shows which tokens a particular word is focusing on.
- To further accelerate our understanding, we will briefly look at the original transformer architecture:
  - Original transformer architecture was designed for translation.
  - During training, the encoder receives inputs (sentences) in a certain language, while the decoder receives the same sentences in the desired target language. 
  - In the encoder, the attention layers can use all the words in a sentence (since, as we just saw, the translation of a given word can be dependent on what is after as well as before it in the sentence). 
  - The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated (so, only the words before the word currently being generated). 
    - For example, when we have predicted the first three words of the translated target, we give them to the decoder which then uses all the inputs of the encoder to try to predict the fourth word.
  - To speed things up during training (when the model has access to target sentences), the decoder is fed the whole target, but it is not allowed to use future words (if it had access to the word at position 2 when trying to predict the word at position 2, the problem would not be very hard!).
  - For instance, when trying to predict the fourth word, the attention layer will only have access to the words in positions 1 to 3.
  - Here's a visualisation of the original Transformer architecture, with the encoder on the left and the decoder on the right:
![Original Transformer Architecture Visualisation](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers-dark.svg)
  - Note that the first attention layer in a decoder block pays attention to all (past) inputs to the decoder, but the second attention layer uses the output of the encoder. It can thus access the whole input sentence to best predict the current word.
    - This is very useful as different languages can have grammatical rules that put the words in different orders, or some context provided later in the sentence may be helpful to determine the best translation of a given word.
  - The attention mask can also be used in the encoder/decoder to prevent the model from paying attention to some special words — for instance, the special padding word used to make all the inputs the same length when batching together sentences.
  - Hopefully, this gave you a solid mental model of how the original Transformer architecture works — particularly the encoder-decoder setup and how attention mechanisms shape language understanding and generation. We covered this not just for historical context, but because these components (like self-attention, masking, and encoder-decoder flows) form the foundation of nearly all modern Transformer-based models. 
    - Understanding them now will make it much easier to grasp variants like GPT, BERT, or T5, and to debug, fine-tune, or even build your own architectures later on.
  #### Architectures and checkpoints
  - This is terminology that is important to know for now and later. You’ll see mentions of architectures and checkpoints as well as models. These terms all have slightly different meanings:
    - **Architecture**: This is the skeleton of the model — the definition of each layer and each operation that happens within the model.
    - **Checkpoints**: These are the weights that will be loaded in a given architecture.
    - **Model**: This is an umbrella term that isn’t as precise as “architecture” or “checkpoint": it can mean both. This course will specify architecture or checkpoint when it matters to reduce ambiguity.
    - Example: BERT is an architecture while bert-base-cased, a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say “the BERT model” and “the bert-base-cased model.”


### How transformers solve their tasks
- Now, we will delve into specific transformer architectural variants, and how they solve their respective, specific tasks. Before we do this however, its important to understand that most tasks follow a similar pattern: input data is processed through a model, and the output is interpreted for a specific task. The differences lie in how the data is prepared, what model architecture variant is used, and how the output is processed.
- To explain how tasks are solved, we’ll walk through what goes on inside the model to output useful predictions. We’ll cover the following models and their corresponding tasks:

  - Wav2Vec2 for audio classification and automatic speech recognition (ASR)
  - Vision Transformer (ViT) and ConvNeXT for image classification
  - DETR for object detection
  - Mask2Former for image segmentation
  - GLPN for depth estimation
  - BERT for NLP tasks like text classification, token classification and question answering that use an encoder
  - GPT2 for NLP tasks like text generation that use a decoder
  - BART for NLP tasks like summarization and translation that use an encoder-decoder



## Language models and the most famous examples:
- All the popular Transformer models (GPT, BERT, T5, etc.) have been trained as language models. 
- Language models are just models that have been trained on large amounts of raw text in a self-supervised fashion.
- (Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!)
- This type of model develops a statistical understanding of the language it has been trained on, but it’s less useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called transfer learning or fine-tuning. During this process, the model is fine-tuned in a supervised way — that is, using human-annotated labels — on a given task.
- An example of a task is predicting the next word in a sentence having read the n previous words. This is called causal language modeling because the output depends on the past and present inputs, but not the future ones.
- Example below:

![Causal language modeling visualization](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/causal_modeling-dark.svg)

- Additionally, the general strategy to achieve better performance is by increasing the models’ sizes as well as the amount of data they are pretrained on.
- Unfortunately, training a model, especially a large one, requires a large amount of data. This becomes very costly in terms of time and compute resources.
- This means sharing models/resources is very efficient and optimal, because it saves time, potentially money, and just overall resources for everyone.

## Transfer Learning
- Initializing a model with another model's weights.
- Essentially, you leverage the knowledge acquired by a model trained on LOTS of data on another task.
### Pre-training
- **Pretraining**: training a model from scratch. The weights are initially randomly initialized, and the training starts with 0 knowledge.
  - This pretraining is usually done on very large amounts of data. Therefore, it requires a very large corpus of data, and training can take up to several weeks.
- **Fine-tuning**: training done AFTER a model has been pretrained. To perform fine-tuning, you first need a pretrained language model, then need to perform additional training with a dataset highly specific to your task.
  - Now, this might seem confusing. Just train your model for your final use case from the start right? Here's some reasons why you do things this way:
  - The pretrained model was already trained on a dataset that has some similarities with the fine-tuning dataset. The fine-tuning process is thus able to take advantage of knowledge acquired by the initial model during pretraining (for instance, with NLP problems, the pretrained model will have some kind of statistical understanding of the language you are using for your task).
  - Since the pretrained model was already trained on lots of data, the fine-tuning requires way less data to get decent results.
  - For the same reason, the amount of time and resources needed to get good results are much lower.
  - Fine-tuning example:
    - You can take a pretrained model (that was thoroughly trained on the English language), and fine-tune it on the arXiv corpus, which then results in a science/research-based model.
      - Corpus: a large and structured collection of text or speech data used for linguistic analysis and training ML models. In NLP/LLM context, it acts as the training data for our models that need to generate human language. 
      - arXiv corpus: arXiv is an archive of scholarly articles, primarily in scientific fields. So the arXiv corpus is a collection of scholarly articles available on the arXiv repository. 
    - The fine-tuning will only require a limited amount of data. The knowledge the pretrained model has acquired is “transferred,” (You’re transferring the capabilities of the pretrained model to a new domain/task using the arXiv data), hence the term transfer learning.
- Important to note, that transfer learning **ENCOMPASSES** fine-tuning.
- In most situations, fine-tuning revolves around retraining the last couple layers of a model, but can be applied to the whole model.  
- Fine-tuning a model has lower time, data, financial, and environmental costs. It is also quicker and easier to iterate over different fine-tuning schemes (full fine-tuning, last-layer tuning, freezing first n-layers and training/readjusting the rest, peft, etc), as the training is less constraining (cheaper, faster, easier to experiment with) than a full pretraining.
- This process will also achieve better results than training from scratch (unless you have lots of data), which is why you should always try to leverage a pretrained model — one as close as possible to the task you have at hand — and fine-tune it.