# Transfomers
- Transformers are perhaps one of the most special, impactful and important creations in AI/Computer Science of **all time**.
- Transformer models are used to solve all kinds of tasks across different modalities, including natural language processing (NLP), computer vision, audio processing, and more.

### Hugging Face Transformers Library

The **Hugging Face Transformers** library is the most important and widely-used library for working with transformer models in the AI/ML ecosystem.

### What is it?
- A comprehensive Python library that provides easy access to thousands of pre-trained transformer models
- Supports all major transformer architectures: BERT, GPT, T5, RoBERTa, DistilBERT, and many more
- Provides a unified, consistent API regardless of the underlying model architecture
- Includes tools for fine-tuning, training from scratch, and deploying models to production

### Why it's significant
The library democratized AI by making state-of-the-art transformer models accessible to developers, researchers, and companies worldwide. It created a standardized interface that works across different model families and tasks, building the largest model hub with 100,000+ pre-trained models shared by the community. This massive ecosystem is now used by thousands of companies in real-world applications.

### The Pipeline Function - The Game Changer

- The `pipeline()` function is the most simple object in the transformers library, yet perhaps the most revolutionary feature - it's like having a "one-click" solution for AI tasks. It handles tokenization, model loading, and post-processing automatically, supporting 20+ different tasks (classification, generation, translation, summarization, etc.) with zero configuration needed.
- You can think of it as something that connects a model with its necessary preprocessing and postprocessing steps. This lets us directly input any text and get an intelligible answer.
- Below I will show you an example of sentiment analysis using the pipeline function, on our own inputs.

In [6]:
%pip install -r transformers_requirements.txt

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting einops (from -r transformers_requirements.txt (line 1))
  Using cached einops-0.8.1-py3-none-any.whl.metadata (13 kB)
Using cached einops-0.8.1-py3-none-any.whl (64 kB)
Installing collected packages: einops
Successfully installed einops-0.8.1
Note: you may need to restart the kernel to use updated packages.


In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis") # there are SO many differents tasks you could name here
# by default, a particular pretrained model that has been fine-tuned for sentiment analysis in English gets chosen
# here, sentinment-analysis defaults to distilbert-base-uncased-finetuned-sst-2-english,
# which is part of the BERT model lines (by Google)
classifier("I'm so hungry man!")

  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'label': 'NEGATIVE', 'score': 0.9931273460388184}]

In [5]:
# Can even do multiple outputs
classifier(
    ["I'm so hungry man.", "I love food so much."]
)

[{'label': 'NEGATIVE', 'score': 0.9961236119270325},
 {'label': 'POSITIVE', 'score': 0.9998375177383423}]

It's worth noting that the model gets downloaded and cached when you create the classifier object. Now, everytime you rerun the command, the cached model gets used instead, no need for repeated downloads.
Here's what happens everytime you pass some text into a pipeline:
- The text is preprocessed into a format the model can understand.
- The preprocessed inputs are passed to the model.
- The predictions of the model are post-processed, so you can make sense of them.

It's also worth noting the different tasks you can do with the pipeline object. Below are some examples:

Text pipelines
- text-generation: Generate text from a prompt
- text-classification: Classify text into predefined categories
- summarization: Create a shorter version of a text while preserving key information
- translation: Translate text from one language to another
- zero-shot-classification: Classify text without prior training on specific labels
- feature-extraction: Extract vector representations of text

Image pipelines
- image-to-text: Generate text descriptions of images
- image-classification: Identify objects in an image
- object-detection: Locate and identify objects in images

Audio pipelines
- automatic-speech-recognition: Convert speech to text
- audio-classification: Classify audio into categories
- text-to-speech: Convert text to spoken audio

Multimodal pipelines
- image-text-to-text: Respond to an image based on a text prompt

Now, lets try classifying texts that haven't been labelled. We're going to use something called zero-shot classification, as it allows us to specify which labels to use for classification, so we don't have to rely on pretrained models' labels. 
- (The pipeline is called zero-shot because you don’t need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want.)

In the above example, we classified using 2 labels: positive and negative. But now I'll show you how to classify text using ANY set of labels of your choice.

In [6]:
classifier = pipeline("zero-shot-classification")
classifier(
    "History isn't really a fun class",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


{'sequence': "History isn't really a fun class",
 'labels': ['education', 'business', 'politics'],
 'scores': [0.7267491817474365, 0.19325977563858032, 0.07999099045991898]}

You can also choose a specific model for a specific task!

In [10]:
generator = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-360M") # all these models names
# are names directly from the hugging face hub's website. You can plug in ANY model from their hub.
# It includes all possible basic models too, that you could ever think of.

# This is text generation, where you just provide some prompt, and the model auto-completes it by generating the remaining text.
# Similar to text predictions you see on your phone.
# Text generation is random, and it is unlikely you will get the exact same responses every time.

generator(
    "So today, I just feel like",
    max_length=30,
    num_return_sequences=2,
)

Device set to use mps:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': "So today, I just feel like I need to make sure that I am getting all of my 2016 goals accomplished. \xa0I have my goals for 2016, but I didn't get everything accomplished. \xa0So, I want to make sure that I can see what I have accomplished.\n\nI need to get more into the habit of writing in my journal, I think that will help me get through a lot of things. \xa0I know that when I feel like I am not feeling good, I can just go into my journal and take a look at what I should be doing. \xa0It will help me see where I need to improve.\n\nI am going to post the 2016 goals that I have listed in this post. \xa0I have listed the goals for each day, and will have a few goals for each week and month.\n\nI want to be able to see these goals, so I can see where I am and where I need to improve. \xa0I know that if I don't see these, I will feel discouraged and will not be able to get through a lot of things.\n\nI want to be able to see how far I have come, so even if I didn't a

There are SO many other tasks and examples I could show you, that I will not be exhaustively going over them here.

## Combining data from multiple sources
One powerful application of Transformer models is their ability to combine and process data from multiple sources. This is especially useful when you need to:

- Search across multiple databases or repositories
- Consolidate information from different formats (text, images, audio)
- Create a unified view of related information

For example, you could build a system that:
- Searches for information across databases in multiple modalities like text and image.
- Combines results from different sources into a single coherent response. For example, from an audio file and text description.
- Presents the most relevant information from a database of documents and metadata.

This is an import aspect of transformers, and is just something worth noting.
Next, we're going to get into the really good stuff, the nitty-gritty, how transformers ACTUALLY work, whats goes on in the inside, and transformer model architecture, in general.

# Transformer Architecture


## Language models and the most famous examples:
- All the popular Transformer models (GPT, BERT, T5, etc.) have been trained as language models. 
- Language models are just models that have been trained on large amounts of raw text in a self-supervised fashion.
- (Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!)
- This type of model develops a statistical understanding of the language it has been trained on, but it’s less useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called transfer learning or fine-tuning. During this process, the model is fine-tuned in a supervised way — that is, using human-annotated labels — on a given task.
- An example of a task is predicting the next word in a sentence having read the n previous words. This is called causal language modeling because the output depends on the past and present inputs, but not the future ones.
- Example below:

![Causal language modeling visualization](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/causal_modeling-dark.svg)

- Additionally, the general strategy to achieve better performance is by increasing the models’ sizes as well as the amount of data they are pretrained on.
- Unfortunately, training a model, especially a large one, requires a large amount of data. This becomes very costly in terms of time and compute resources.
- This means sharing models/resources is very efficient and optimal, because it saves time, potentially money, and just overall resources for everyone.

## Transfer Learning
- 
### Pre-training
- **Pretraining**: training a model from scratch. The weights are initially randomly initialized, and the training starts with 0 knowledge.
  - This pretraining is usually done on very large amounts of data. Therefore, it requires a very large corpus of data, and training can take up to several weeks.
- **Fine-tuning**: training done AFTER a model has been pretrained. To perform fine-tuning, you first need a pretrained language model, then need to perform additional training with a dataset highly specific to your task.
  - Now, this might seem confusing. Just train your model for your final use case from the start right? Here's some reasons why you do things this way:
  - The pretrained model was already trained on a dataset that has some similarities with the fine-tuning dataset. The fine-tuning process is thus able to take advantage of knowledge acquired by the initial model during pretraining (for instance, with NLP problems, the pretrained model will have some kind of statistical understanding of the language you are using for your task).
  - Since the pretrained model was already trained on lots of data, the fine-tuning requires way less data to get decent results.
  - For the same reason, the amount of time and resources needed to get good results are much lower.