# Hugging Face Transformers 101

The objective of this notebook is to give an extensive summary on how to use the [transformers](https://github.com/huggingface/transformers) library from [Hugging Face](https://huggingface.co/).

**Useful links:**
- [Hugging Face - Model Hub](https://huggingface.co/models)
- [Hugging Face - Datasets](https://huggingface.co/datasets)


In [1]:
pip install --upgrade transformers

Collecting transformers
  Downloading transformers-4.44.0-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.44.0-py3-none-any.whl (9.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m67.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.42.3
    Uninstalling transformers-4.42.3:
      Successfully uninstalled transformers-4.42.3
Successfully installed transformers-4.44.0
Note: you may need to restart the kernel to use updated packages.


# Overview

The transformers library by Hugging Face provides two main ways to use pre-trained models:
1. **Pipelines**
    - **Purpose**: Pipelines are designed for quick, easy, and high-level access to various NLP tasks like text classification, question answering, text generation, etc. They abstract away much of the complexity involved in setting up and using models and tokenizers.
    - **Ease of Use**: Pipelines are user-friendly and require minimal code to get started. You don’t need to worry about loading models or tokenizers separately.
    - **Flexibility**: Pipelines are less flexible since they are designed for specific tasks and operate within the constraints of the task-specific settings.
    - **Customization**: Limited customization options. The parameters and the way data flows through the pipeline are predefined.
    - **Ideal For**: Beginners, rapid prototyping, and tasks where you don’t need to fine-tune or customize the behavior of the models.


2. **AutoModel/AutoTokenizer Classes**
    - **Purpose**: These classes are more low-level and provide greater flexibility and control over the models and tokenizers. They allow you to load any pre-trained model or tokenizer from the model hub.
    - **Ease of Use**: Requires more setup compared to pipelines. You need to explicitly load the tokenizer and model and handle the inputs and outputs manually.
    - **Flexibility**: Highly flexible. You can customize almost every aspect of the model’s behavior, modify the data preprocessing, and tweak how the outputs are handled.
    - **Customization**: Extensive customization options. You can fine-tune models, change tokenization strategies, modify model architecture, or integrate with other libraries for advanced use cases.
    - **Ideal For**: Advanced users, research, fine-tuning models, and scenarios where you need to go beyond the default behavior of the pipelines.


## Pipelines

The [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines) is the easiest and fastest way to use a pretrained model for inference. In this case the pipeline downloads and caches a default pretrained model and tokenizer for sentiment analysis

| **Task**                     | **Description**                                                                                              | **Modality**    | **Pipeline identifier**                       |
|------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------|-----------------------------------------------|
| Text classification          | assign a label to a given sequence of text                                                                   | NLP             | pipeline(task=“sentiment-analysis”)           |
| Text generation              | generate text given a prompt                                                                                 | NLP             | pipeline(task=“text-generation”)              |
| Summarization                | generate a summary of a sequence of text or document                                                         | NLP             | pipeline(task=“summarization”)                |
| Image classification         | assign a label to an image                                                                                   | Computer vision | pipeline(task=“image-classification”)         |
| Image segmentation           | assign a label to each individual pixel of an image (supports semantic, panoptic, and instance segmentation) | Computer vision | pipeline(task=“image-segmentation”)           |
| Object detection             | predict the bounding boxes and classes of objects in an image                                                | Computer vision | pipeline(task=“object-detection”)             |
| Audio classification         | assign a label to some audio data                                                                            | Audio           | pipeline(task=“audio-classification”)         |
| Automatic speech recognition | transcribe speech into text                                                                                  | Audio           | pipeline(task=“automatic-speech-recognition”) |
| Visual question answering    | answer a question about the image, given an image and a question                                             | Multimodal      | pipeline(task=“vqa”)                          |
| Document question answering  | answer a question about a document, given an image and a question                                            | Multimodal      | pipeline(task="document-question-answering")  |
| Image captioning             | generate a caption for a given image                                                                         | Multimodal      | pipeline(task="image-to-text")                |


1 - `sentiment-analysis` - Vector as an Input

In [2]:
from transformers import pipeline

# We can create a vector of data for the classifier
classifier = pipeline("sentiment-analysis")
prompts = ["This is a very happy example :).",
           "We hope you don't hate it."]
results = classifier(prompts)
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

2024-08-17 02:49:03.420800: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-17 02:49:03.420905: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-17 02:49:03.582274: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309


2 -  `automatic-speech-recognition` - Dataset as an Input

In [3]:
from datasets import load_dataset, Audio

# Or we can give it an entire dataset
# Lets use automatic speech recognition
speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h")
# Lets load the dataset
dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
# Make sure that the data set matches the sampling rate in which the model was trained
dataset = dataset.cast_column("audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate))
# Print the written audio
result = speech_recognizer(dataset[:4]["audio"])
print([d["text"] for d in result])

config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You sho

tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Downloading builder script:   0%|          | 0.00/5.90k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.29k [00:00<?, ?B/s]

The repository for PolyAI/minds14 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/PolyAI/minds14.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT', "FONDERING HOW I'D SET UP A JOIN TO HELL T WITH MY WIFE AND WHERE THE AP MIGHT BE", "I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE APSO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AN I'M MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS", 'HOW DO I FURN A JOINA COUT']


- 3: `multilingual-uncased-sentiment` - Selectring specific model and tokenizer in the pipeline

Use `AutoModelForSequenceClassification` and `AutoTokenizer` to load the pretrained model and it's associated tokenizer

In [4]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Select the model name
model_name = "nlptown/bert-base-multilingual-uncased-sentiment" # predicts the sentiment of the review as a number of stars (between 1 and 5).
# Use AutoModelForSequenceClassification and AutoTokenizer from the named model
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Define the pipeline
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
# Run the pipeline
classifier("Nous sommes très heureux de vous présenter la bibliothèque 🤗 Transformers.")

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': '5 stars', 'score': 0.7272651791572571}]

## AutoClass
AutoClass is a shortcut that automatically retrieves the architecture of a pretrained model from its name or path

### Autotokenizer

Tokenizer is a dictionary that returns:

- `input_ids`: numerical representations of your tokens.
- `attention_mask`: indicates which tokens should be attended to.

In [5]:
from transformers import AutoTokenizer

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)

encoding = tokenizer("We are very happy to show you the 🤗 Transformers library.")
# The tokenizer returns a dictionary containing: input_ids & attention_mask
print("encoding -> input_ids: ", encoding["input_ids"])
print("encoding -> attention_mask: ", encoding["attention_mask"])

encoding -> input_ids:  [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102]
encoding -> attention_mask:  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [6]:
# tokenizer can accept other inputs
pt_batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)
print("pt_batch -> input_ids: ", pt_batch["input_ids"])
print("pt_batch -> attention_mask: ", pt_batch["attention_mask"])

pt_batch -> input_ids:  tensor([[  101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103,   100,
         58263, 13299,   119,   102],
        [  101, 11312, 18763, 10855, 11530,   112,   162, 39487, 10197,   119,
           102,     0,     0,     0]])
pt_batch -> attention_mask:  tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])


### AutoModel
It is important to select the desired [task](https://huggingface.co/docs/transformers/main/en/task_summary) of the model. The model outputs the final activations in the `logits` attribute, so by applying softmax function to the `logits` it is possible retrieve the probabilities

In [7]:
from transformers import AutoModelForSequenceClassification
from torch import nn

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Pass the preprocessed batch of inputs directly to the model by unpack the dictionary with **
pt_outputs = pt_model(**pt_batch)

pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
print(pt_predictions)

tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
        [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=<SoftmaxBackward0>)


All 🤗 Transformers models output the tensors before the final activation function (like softmax) because the final activation function is often fused with the loss.


### Save a Model

1. `save_pretrained` to save the model

In [8]:
# Once the model is fine-tuned you can save it
pt_save_directory = "./pt_save_pretrained"
tokenizer.save_pretrained(pt_save_directory)
pt_model.save_pretrained(pt_save_directory)

2. `from_pretrained` to load the model

In [9]:
# Then you can reload it
pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained")

## Custom Model Builds
You can modify the model's configuration class to change how a model is built

1. `Autoconfig` to store the pretrained model config, and yoy select the attribute that you want to change 

In [10]:
from transformers import AutoConfig
my_config = AutoConfig.from_pretrained("distilbert-base-uncased", n_heads=12)

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

2. `AutoModel` to create a new model with the custom config

In [11]:
from transformers import AutoModel
my_model = AutoModel.from_config(my_config)

## Trainer (Pytorch)

- All models are a standard `torch.nn.Module` so you can use them in any typical training loop 
- Transformers provides a [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) class for PyTorch
     - Contains the basic training loop and adds additional functionality Idistributed training, mixed precision, etc)
     - You can also write your own training loop

1.  **Model** (`PreTrainedModel` or `torch.nnModule`)

In [12]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


2. **Training Arguments** (hyperparametes)

In [13]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="path/to/save/folder/",
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=8,
                                  per_device_eval_batch_size=8,
                                  num_train_epochs=2,
)

3. **Preprocessing class** (tokenizer, image processor, feature extractor, or processor)

In [14]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



4. **Load a dataset**

In [15]:
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes")  # doctest: +IGNORE_RESULT

Downloading readme:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/699k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

5. **Tokenize the dataset**

In [16]:
# Create a function to tokenize the dataset
def tokenize_dataset(dataset):
    return tokenizer(dataset["text"])

# Apply it to the entire dataset
dataset = dataset.map(tokenize_dataset, batched=True)

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

6 - **Data Collator with Padding** (to create a batch of examples from your dataset)

In [17]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [18]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)  # doctest: +SKIP

In [19]:
trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Step,Training Loss
500,0.4174
1000,0.2496




TrainOutput(global_step=1068, training_loss=0.3269694476538383, metrics={'train_runtime': 146.3854, 'train_samples_per_second': 116.542, 'train_steps_per_second': 7.296, 'total_flos': 215637261882480.0, 'train_loss': 0.3269694476538383, 'epoch': 2.0})

# Examples of Different Tasks

## Sequence Classification

In [20]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The sun sets behind the mountains, painting the sky in shades of orange and pink"
sequence_1 = "The city buzzed with life as the night markets opened, filling the streets with vibrant colors and delicious aromas"
sequence_2 = "The sky turns a blend of orange and pink as the sun dips below the mountain peaks."

# The tokenizer will automatically add any model specific separators, as well as compute the attention masks.
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits

paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]

print("This should be paraphrase")
for i in range(len(classes)):
    print(f"-> {classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
    
print("This should not be paraphrase")
for i in range(len(classes)):
    print(f"-> {classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

This should be paraphrase
-> not paraphrase: 8%
-> is paraphrase: 92%
This should not be paraphrase
-> not paraphrase: 94%
-> is paraphrase: 6%


## Extractive Queue Answering

In [21]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

# Load pre-trained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

# Define a new context text
text = r"""
Python is a versatile programming language that supports multiple programming paradigms, including procedural,
object-oriented, and functional programming. Python's extensive standard library and its dynamic nature make it
a popular choice for both beginners and experienced developers. It is widely used in web development, data science,
automation, and scientific computing.
"""

# Define a new set of questions based on the context
questions = [
    "What programming paradigms does Python support?",
    "Why is Python popular among developers?",
    "In what fields is Python widely used?",
]

# Iterate over each question and find the answer
for question in questions:
    # Tokenize the question and context together
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    # Pass the tokenized inputs to the model
    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    # Identify the start and end of the answer
    answer_start = torch.argmax(answer_start_scores)
    answer_end = torch.argmax(answer_end_scores) + 1

    # Convert the token IDs to a string
    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
    )

    # Print the question and the corresponding answer
    print(f"Question: {question}")
    print(f"Answer: {answer}")


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Question: What programming paradigms does Python support?
Answer: procedural, object - oriented, and functional programming
Question: Why is Python popular among developers?
Answer: extensive standard library and its dynamic nature
Question: In what fields is Python widely used?
Answer: web development, data science, automation, and scientific computing


## Masked Language Modeling

In [22]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-cased")

# Define a new sequence with a masked token
sequence = (
    "Artificial intelligence is a rapidly growing field that has the potential to "
    f"revolutionize {tokenizer.mask_token} in many industries."
)

# Tokenize the input sequence
inputs = tokenizer(sequence, return_tensors="pt")
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

# Get the logits from the model
token_logits = model(**inputs).logits
mask_token_logits = token_logits[0, mask_token_index, :]

# Get the top 5 token predictions for the masked token
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

# Replace the masked token with each of the top 5 predictions and print the results
for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

Artificial intelligence is a rapidly growing field that has the potential to revolutionize computing in many industries.
Artificial intelligence is a rapidly growing field that has the potential to revolutionize technology in many industries.
Artificial intelligence is a rapidly growing field that has the potential to revolutionize intelligence in many industries.
Artificial intelligence is a rapidly growing field that has the potential to revolutionize innovation in many industries.
Artificial intelligence is a rapidly growing field that has the potential to revolutionize science in many industries.


## Causal Language Modeling

In [23]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from torch import nn

# Initialize the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Input sequence
sequence = "Hugging Face is based in DUMBO, New York City, and"

# Tokenize the input sequence
inputs = tokenizer(sequence, return_tensors="pt")
input_ids = inputs["input_ids"]

# Get logits of the last hidden state
next_token_logits = model(**inputs).logits[:, -1, :]

# Implement top-k and top-p filtering manually
def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')):
    assert logits.dim() == 1  # Ensure logits have the expected shape

    # Top-k filtering
    if top_k > 0:
        top_k = min(top_k, logits.size(-1))  # Ensure top_k is within the vocabulary size
        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
        logits[indices_to_remove] = filter_value

    # Nucleus (top-p) filtering
    if top_p > 0.0:
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        cumulative_probs = torch.cumsum(nn.functional.softmax(sorted_logits, dim=-1), dim=-1)

        # Remove tokens with cumulative probability above the threshold
        sorted_indices_to_remove = cumulative_probs > top_p
        # Shift the indices to the right to keep also the first token above the threshold
        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
        sorted_indices_to_remove[..., 0] = 0

        indices_to_remove = sorted_indices[sorted_indices_to_remove]
        logits[indices_to_remove] = filter_value

    return logits

# Apply the filtering function
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits.squeeze(), top_k=50, top_p=1.0)

# Sample from the filtered logits
probs = nn.functional.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)

# Concatenate the sampled token to the input sequence
generated = torch.cat([input_ids, next_token.unsqueeze(0)], dim=-1)

# Decode the generated sequence into text
resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Hugging Face is based in DUMBO, New York City, and is


## Text Generation

In [24]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained("xlnet-base-cased")
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

# PADDING_TEXT with a story
PADDING_TEXT = """In the year 2075, humanity has established colonies on Mars. The red planet is now home to
a thriving community of scientists, engineers, and families. The story begins with the discovery of an
ancient Martian artifact buried beneath the surface. Dr. Elara Quinn, a leading archaeologist, is called
to investigate the find. As she examines the artifact, she experiences strange visions and begins to
unravel the secrets of a long-lost Martian civilization. The discovery sets off a chain of events that
could change the course of human history. <eod> </s> <eos>"""

# Prompt
prompt = "The team of astronauts prepared for their mission, knowing that "

# Tokenize the combined PADDING_TEXT and prompt
inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]

# Calculate the length of the prompt
prompt_length = len(tokenizer.decode(inputs[0]))

# Generate the continuation of the text
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length + 1:]

# Print the generated text
print(generated)


config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/467M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (-1). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


The team of astronauts prepared for their mission, knowing that they would be a leader, the last individual to be in the service of the United States will be able to do so in the first few months of the service. With a mission planned for the next 12 weeks, the team of astronauts have to consider what they can do for the mission. They will also need to plan for the long-term success of the mission, making sure that the company can sustain itself and their employees. They will also have to prepare for the very long-term commitment of the mission.<eop> The first three weeks of the mission have been planned for six months by NASA and will include several weeks of a


## Named Entity Recognition

In [25]:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

# Load the model and tokenizer
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Sequence for Named Entity Recognition (NER)
sequence = (
    "NASA's Jet Propulsion Laboratory (JPL) is a research and development center "
    "located in Pasadena, California. JPL is responsible for several high-profile space missions."
)

# Tokenize the input sequence
inputs = tokenizer(sequence, return_tensors="pt")
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

# Get model outputs
outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)

# Print tokens and their predicted labels
for token, prediction in zip(tokens, predictions[0].numpy()):
    print((token, model.config.id2label[prediction]))


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

('[CLS]', 'O')
('NASA', 'I-ORG')
("'", 'O')
('s', 'O')
('Jet', 'I-ORG')
('Pro', 'I-ORG')
('##pulsion', 'I-ORG')
('Laboratory', 'I-ORG')
('(', 'O')
('JP', 'I-ORG')
('##L', 'I-ORG')
(')', 'O')
('is', 'O')
('a', 'O')
('research', 'O')
('and', 'O')
('development', 'O')
('center', 'O')
('located', 'O')
('in', 'O')
('Pasadena', 'I-LOC')
(',', 'O')
('California', 'I-LOC')
('.', 'O')
('JP', 'I-ORG')
('##L', 'I-ORG')
('is', 'O')
('responsible', 'O')
('for', 'O')
('several', 'O')
('high', 'O')
('-', 'O')
('profile', 'O')
('space', 'O')
('missions', 'O')
('.', 'O')
('[SEP]', 'O')


## Summarization

In [26]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# ARTICLE
ARTICLE = """In 2023, scientists at the European Space Agency (ESA) achieved a significant milestone in space exploration with the successful deployment of the Euclid space telescope.
This telescope is designed to investigate the mysterious dark energy and dark matter that make up most of the universe's mass-energy content. Euclid will map the geometry of the dark universe
by measuring the shapes and distances of billions of galaxies. The mission aims to improve our understanding of the universe's expansion and the forces driving it. The Euclid mission represents
a collaborative effort involving international space agencies and research institutions. The telescope is equipped with state-of-the-art instruments to capture detailed images of distant galaxies
and analyze their distribution. By studying the large-scale structure of the universe, scientists hope to unlock new insights into fundamental questions about cosmic evolution and the nature of dark
energy. The launch of Euclid marks a major advancement in our quest to explore the cosmos and understand the underlying forces shaping our universe."""

# Load the model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

# Prepare the input for summarization
inputs = tokenizer("summarize: " + ARTICLE, return_tensors="pt", max_length=512, truncation=True)

# Generate the summary
outputs = model.generate(
    inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
)

# Decode and print the summary
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

the telescope is designed to investigate the mysterious dark energy and dark matter that make up most of the universe's mass-energy content. the launch of Euclid marks a major advancement in our quest to explore the cosmos and understand the forces shaping our universe.


## Translation

In [27]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load the model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

# New text for translation
inputs = tokenizer(
    "translate English to German: The quick brown fox jumps over the lazy dog",
    return_tensors="pt",
)
outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)

# Print the translated text
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


Der schnelle braune Fuchs springt über den faulen Hund


## Image Classification

In [28]:
from transformers import AutoImageProcessor, AutoModelForImageClassification
import torch
from datasets import load_dataset

# Load dataset and select an image
dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]

# Load processor and model
feature_extractor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")

# Preprocess the image
inputs = feature_extractor(image, return_tensors="pt")

# Perform inference
with torch.no_grad():
    logits = model(**inputs).logits

# Get and print the predicted label
predicted_label = logits.argmax(-1).item()
print(model.config.id2label[predicted_label])

Downloading builder script:   0%|          | 0.00/2.56k [00:00<?, ?B/s]

The repository for huggingface/cats-image contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/huggingface/cats-image.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


Downloading data:   0%|          | 0.00/173k [00:00<?, ?B/s]

Generating test split: 0 examples [00:00, ? examples/s]

preprocessor_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/69.7k [00:00<?, ?B/s]

Fast image processor class <class 'transformers.models.vit.image_processing_vit_fast.ViTImageProcessorFast'> is available for this model. Using slow image processor class. To use the fast image processor class set `use_fast=True`.


model.safetensors:   0%|          | 0.00/346M [00:00<?, ?B/s]

Egyptian cat
