# Hugging Face Transformers 101

The objective of this notebook is to give an extensive summary on how to use the [transformers](https://github.com/huggingface/transformers) library from [Hugging Face](https://huggingface.co/).

**Useful links:**
- [Hugging Face - Model Hub](https://huggingface.co/models)
- [Hugging Face - Datasets](https://huggingface.co/datasets)


# Overview

The transformers library by Hugging Face provides two main ways to use pre-trained models:
1. **Pipelines**
    - **Purpose**: Pipelines are designed for quick, easy, and high-level access to various NLP tasks like text classification, question answering, text generation, etc. They abstract away much of the complexity involved in setting up and using models and tokenizers.
    - **Ease of Use**: Pipelines are user-friendly and require minimal code to get started. You don’t need to worry about loading models or tokenizers separately.
    - **Flexibility**: Pipelines are less flexible since they are designed for specific tasks and operate within the constraints of the task-specific settings.
    - **Customization**: Limited customization options. The parameters and the way data flows through the pipeline are predefined.
    - **Ideal For**: Beginners, rapid prototyping, and tasks where you don’t need to fine-tune or customize the behavior of the models.


2. **AutoModel/AutoTokenizer Classes**
    - **Purpose**: These classes are more low-level and provide greater flexibility and control over the models and tokenizers. They allow you to load any pre-trained model or tokenizer from the model hub.
    - **Ease of Use**: Requires more setup compared to pipelines. You need to explicitly load the tokenizer and model and handle the inputs and outputs manually.
    - **Flexibility**: Highly flexible. You can customize almost every aspect of the model’s behavior, modify the data preprocessing, and tweak how the outputs are handled.
    - **Customization**: Extensive customization options. You can fine-tune models, change tokenization strategies, modify model architecture, or integrate with other libraries for advanced use cases.
    - **Ideal For**: Advanced users, research, fine-tuning models, and scenarios where you need to go beyond the default behavior of the pipelines.


## Pipelines

The [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines) is the easiest and fastest way to use a pretrained model for inference. In this case the pipeline downloads and caches a default pretrained model and tokenizer for sentiment analysis

| **Task**                     | **Description**                                                                                              | **Modality**    | **Pipeline identifier**                       |
|------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------|-----------------------------------------------|
| Text classification          | assign a label to a given sequence of text                                                                   | NLP             | pipeline(task=“sentiment-analysis”)           |
| Text generation              | generate text given a prompt                                                                                 | NLP             | pipeline(task=“text-generation”)              |
| Summarization                | generate a summary of a sequence of text or document                                                         | NLP             | pipeline(task=“summarization”)                |
| Image classification         | assign a label to an image                                                                                   | Computer vision | pipeline(task=“image-classification”)         |
| Image segmentation           | assign a label to each individual pixel of an image (supports semantic, panoptic, and instance segmentation) | Computer vision | pipeline(task=“image-segmentation”)           |
| Object detection             | predict the bounding boxes and classes of objects in an image                                                | Computer vision | pipeline(task=“object-detection”)             |
| Audio classification         | assign a label to some audio data                                                                            | Audio           | pipeline(task=“audio-classification”)         |
| Automatic speech recognition | transcribe speech into text                                                                                  | Audio           | pipeline(task=“automatic-speech-recognition”) |
| Visual question answering    | answer a question about the image, given an image and a question                                             | Multimodal      | pipeline(task=“vqa”)                          |
| Document question answering  | answer a question about a document, given an image and a question                                            | Multimodal      | pipeline(task="document-question-answering")  |
| Image captioning             | generate a caption for a given image                                                                         | Multimodal      | pipeline(task="image-to-text")                |


1 - `sentiment-analysis` - Vector as an Input

In [1]:
from transformers import pipeline

# We can create a vector of data for the classifier
classifier = pipeline("sentiment-analysis")
prompts = ["This is a very happy example :).",
           "We hope you don't hate it."]
results = classifier(prompts)
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

2024-08-16 15:31:21.659547: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-16 15:31:21.659675: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-16 15:31:21.839982: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309


2 -  `automatic-speech-recognition` - Dataset as an Input

In [2]:
from datasets import load_dataset, Audio

# Or we can give it an entire dataset
# Lets use automatic speech recognition
speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h")
# Lets load the dataset
dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
# Make sure that the data set matches the sampling rate in which the model was trained
dataset = dataset.cast_column("audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate))
# Print the written audio
result = speech_recognizer(dataset[:4]["audio"])
print([d["text"] for d in result])

config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You sho

tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/5.90k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.29k [00:00<?, ?B/s]

The repository for PolyAI/minds14 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/PolyAI/minds14.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT', "FONDERING HOW I'D SET UP A JOIN TO HELL T WITH MY WIFE AND WHERE THE AP MIGHT BE", "I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE APSO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AN I'M MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS", 'HOW DO I FURN A JOINA COUT']


- 3: `multilingual-uncased-sentiment` - Selectring specific model and tokenizer in the pipeline

Use `AutoModelForSequenceClassification` and `AutoTokenizer` to load the pretrained model and it's associated tokenizer

In [3]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Select the model name
model_name = "nlptown/bert-base-multilingual-uncased-sentiment" # predicts the sentiment of the review as a number of stars (between 1 and 5).
# Use AutoModelForSequenceClassification and AutoTokenizer from the named model
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Define the pipeline
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
# Run the pipeline
classifier("Nous sommes très heureux de vous présenter la bibliothèque 🤗 Transformers.")

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[{'label': '5 stars', 'score': 0.7272652387619019}]

## AutoClass
AutoClass is a shortcut that automatically retrieves the architecture of a pretrained model from its name or path

### Autotokenizer

Tokenizer is a dictionary that returns:

- `input_ids`: numerical representations of your tokens.
- `attention_mask`: indicates which tokens should be attended to.

In [4]:
from transformers import AutoTokenizer

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)

encoding = tokenizer("We are very happy to show you the 🤗 Transformers library.")
# The tokenizer returns a dictionary containing: input_ids & attention_mask
print("encoding -> input_ids: ", encoding["input_ids"])
print("encoding -> attention_mask: ", encoding["attention_mask"])

encoding -> input_ids:  [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102]
encoding -> attention_mask:  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [5]:
# tokenizer can accept other inputs
pt_batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)
print("pt_batch -> input_ids: ", pt_batch["input_ids"])
print("pt_batch -> attention_mask: ", pt_batch["attention_mask"])

pt_batch -> input_ids:  tensor([[  101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103,   100,
         58263, 13299,   119,   102],
        [  101, 11312, 18763, 10855, 11530,   112,   162, 39487, 10197,   119,
           102,     0,     0,     0]])
pt_batch -> attention_mask:  tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])


### AutoModel
It is important to select the desired [task](https://huggingface.co/docs/transformers/main/en/task_summary) of the model. The model outputs the final activations in the `logits` attribute, so by applying softmax function to the `logits` it is possible retrieve the probabilities

In [6]:
from transformers import AutoModelForSequenceClassification
from torch import nn

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Pass the preprocessed batch of inputs directly to the model by unpack the dictionary with **
pt_outputs = pt_model(**pt_batch)

pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
print(pt_predictions)

tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
        [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=<SoftmaxBackward0>)


All 🤗 Transformers models output the tensors before the final activation function (like softmax) because the final activation function is often fused with the loss.


### Save a Model

1. `save_pretrained` to save the model

In [7]:
# Once the model is fine-tuned you can save it
pt_save_directory = "./pt_save_pretrained"
tokenizer.save_pretrained(pt_save_directory)
pt_model.save_pretrained(pt_save_directory)

2. `from_pretrained` to load the model

In [8]:
# Then you can reload it
pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained")

## Custom Model Builds
You can modify the model's configuration class to change how a model is built

1. `Autoconfig` to store the pretrained model config, and yoy select the attribute that you want to change 

In [9]:
from transformers import AutoConfig
my_config = AutoConfig.from_pretrained("distilbert-base-uncased", n_heads=12)

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

2. `AutoModel` to create a new model with the custom config

In [10]:
from transformers import AutoModel
my_model = AutoModel.from_config(my_config)

## Trainer (Pytorch)

- All models are a standard `torch.nn.Module` so you can use them in any typical training loop 
- Transformers provides a [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) class for PyTorch
     - Contains the basic training loop and adds additional functionality Idistributed training, mixed precision, etc)
     - You can also write your own training loop

1.  **Model** (`PreTrainedModel` or `torch.nnModule`)

In [11]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


2. **Training Arguments** (hyperparametes)

In [12]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="path/to/save/folder/",
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=8,
                                  per_device_eval_batch_size=8,
                                  num_train_epochs=2,
)

3. **Preprocessing class** (tokenizer, image processor, feature extractor, or processor)

In [13]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

4. **Load a dataset**

In [14]:
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes")  # doctest: +IGNORE_RESULT

Downloading readme:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/699k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

5. **Tokenize the dataset**

In [15]:
# Create a function to tokenize the dataset
def tokenize_dataset(dataset):
    return tokenizer(dataset["text"])

# Apply it to the entire dataset
dataset = dataset.map(tokenize_dataset, batched=True)

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

6 - **Data Collator with Padding** (to create a batch of examples from your dataset)

In [16]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [17]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)  # doctest: +SKIP

In [18]:
trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:


Abort: 

  ········································


# Examples of Different Tasks

## Sequence Classification

In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The sun sets behind the mountains, painting the sky in shades of orange and pink"
sequence_1 = "The city buzzed with life as the night markets opened, filling the streets with vibrant colors and delicious aromas"
sequence_2 = "The sky turns a blend of orange and pink as the sun dips below the mountain peaks."

# The tokenizer will automatically add any model specific separators, as well as compute the attention masks.
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits

paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]

print("This should be paraphrase")
for i in range(len(classes)):
    print(f"-> {classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
    
print("This should not be paraphrase")
for i in range(len(classes)):
    print(f"-> {classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

This should be paraphrase
-> not paraphrase: 8%
-> is paraphrase: 92%
This should not be paraphrase
-> not paraphrase: 94%
-> is paraphrase: 6%


## Extractive Queue Answering

In [6]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

# Load pre-trained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

# Define a new context text
text = r"""
Python is a versatile programming language that supports multiple programming paradigms, including procedural,
object-oriented, and functional programming. Python's extensive standard library and its dynamic nature make it
a popular choice for both beginners and experienced developers. It is widely used in web development, data science,
automation, and scientific computing.
"""

# Define a new set of questions based on the context
questions = [
    "What programming paradigms does Python support?",
    "Why is Python popular among developers?",
    "In what fields is Python widely used?",
]

# Iterate over each question and find the answer
for question in questions:
    # Tokenize the question and context together
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    # Pass the tokenized inputs to the model
    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    # Identify the start and end of the answer
    answer_start = torch.argmax(answer_start_scores)
    answer_end = torch.argmax(answer_end_scores) + 1

    # Convert the token IDs to a string
    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
    )

    # Print the question and the corresponding answer
    print(f"Question: {question}")
    print(f"Answer: {answer}")


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Question: What programming paradigms does Python support?
Answer: procedural, object - oriented, and functional programming
Question: Why is Python popular among developers?
Answer: extensive standard library and its dynamic nature
Question: In what fields is Python widely used?
Answer: web development, data science, automation, and scientific computing


## Masked Language Modeling

## Causal Language Modeling

## Text Generation

## Named Entity Recognition

## Summarization

## Translation

## Audio Classification

## Automatic Speech Recognition

## Image Classification