[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MariaAise/dl_intro/blob/main/codebook/day1/intro_transformer.ipynb)

Welcome to the Hugging Face

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

Hugging Face provides `pipelines` = ready-made shortcuts.
**bold text**
Each pipeline is already connected to a **pretrained model**.

Hugging Face mapped all basic ML tasks to `pipeline` **tasks**, such as `sentiment-analysis`, etc. it allows you not to think about selecting a particular model just to try the task and run the task on a default (most used model) for this task.

You can view the list of all tasks available via the pipeline at [Link](https://huggingface.co/docs/transformers/en/main_classes/pipelines)

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("Social media is the perfect place to feel connected… to complete strangers I’ll never meet.”")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9992805123329163}]

In [None]:
classifier(
    ["I act as if I don’t care whether people like me. Deep down, I actually enjoy it.", "I find social interactions draining — like sunlight, but worse."]
)

[{'label': 'POSITIVE', 'score': 0.9900863766670227},
 {'label': 'NEGATIVE', 'score': 0.9981635212898254}]

If we want to use a different model, we can easily swap it:

In [None]:
# Example: a RoBERTa model fine-tuned for sentiment
classifier_roberta = pipeline(
    "sentiment-analysis",
    model="cardiffnlp/twitter-roberta-base-sentiment"
)

print("Custom model:", classifier_roberta.model.name_or_path)
result2 = classifier_roberta("Social media is the perfect place to feel connected… to complete strangers I’ll never meet.")
print(result2)

config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Device set to use cpu


Custom model: cardiffnlp/twitter-roberta-base-sentiment
[{'label': 'LABEL_2', 'score': 0.8391926884651184}]


With zero-shot classification, the model can handle new labels it hasn’t been explicitly trained on.

In [None]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "A furry friend that loves chasing balls and wagging its tail.",
    candidate_labels=["dog", "cat", "hamster"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


{'sequence': 'A furry friend that loves chasing balls and wagging its tail.',
 'labels': ['dog', 'cat', 'hamster'],
 'scores': [0.7752008438110352, 0.18641658127307892, 0.038382548838853836]}

In [None]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("I packed my bags, left the city behind, and discovered that")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I packed my bags, left the city behind, and discovered that I had to go to an airport to transfer back to China. It wasn\'t easy.\n\nThen I got word from my wife that I had arrived in China on a train that had to be stopped somewhere else, and that I was going to have to pay a fine of about 5,000 yuan.\n\nI was told that I could go to any airport to get a ticket, but that I had to wait outside the luggage-pane for the ticket.\n\nI couldn\'t find a airport in China. I was forced to ask all my airport officials for help; I was told they were afraid that this would be a "bad experience", and that I would have to pay a fine of about 5,000 yuan.\n\nI received my pay notice at the airport, and was told that I would have to pay a fine of about 10,000 yuan.\n\nI had to wait outside the luggage-pane for the ticket to be accepted.\n\nI was told that I would have to pay a fine of about 10,000 yuan.\n\nThe flight was cancelled.\n\nAfter getting my refund request, I had to pay 

In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "I packed my bags, left the city behind, and discovered that",
    max_length=30,
    num_return_sequences=2,
)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'I packed my bags, left the city behind, and discovered that they had been shot by a man in the middle of nowhere in a neighborhood of about 3,000 square feet.\n\n\n\n\nHe said he had seen the man, a white man, lying in the street, in the middle of the street, in the middle of a street, and thought to himself, "Whew, I\'m going to see you," but he realized that he wasn\'t able to recognize the man.\nHe decided to search for the man and found what appeared to be a silver handgun, but the man was unable to locate it.\n"I thought it was a Silver handgun," he said.\nThe man\'s family members are suing the city of Omaha, Neb., for damages of up to $50,000, and the Omaha Police Department is asking for damages of up to $50,000, and the Omaha Police Department is asking for damages of up to $50,000.\nThe Omaha Police Department is asking for damages of up to $50,000, and the Omaha Police Department is asking for damages of up to $50,000, and the Omaha Police Department is 

"Fill-mask" is a popular task for model training:
- You give the model a sentence with a missing word.
- Mark the missing word with <mask>.
- The model predicts the most likely word(s) that fit

However, this task is also useful for

- Autocomplete & text editors → suggesting missing words.

- Search engines → guessing queries (“best restaurants in <mask>”).

- Data cleaning → detecting odd/missing words.

- Downstream tasks → the representations learned by MLM help in classification, question answering, etc.

In [None]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("Nothing beats a hot slice of <mask> after a long day.", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'score': 0.36462998390197754,
  'token': 9366,
  'token_str': ' pizza',
  'sequence': 'Nothing beats a hot slice of pizza after a long day.'},
 {'score': 0.10187476873397827,
  'token': 11637,
  'token_str': ' pie',
  'sequence': 'Nothing beats a hot slice of pie after a long day.'}]

In [None]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

[{'entity_group': 'PER', 'score': 0.99816, 'word': 'Sylvain', 'start': 11, 'end': 18}, 
 {'entity_group': 'ORG', 'score': 0.97960, 'word': 'Hugging Face', 'start': 33, 'end': 45}, 
 {'entity_group': 'LOC', 'score': 0.99321, 'word': 'Brooklyn', 'start': 49, 'end': 57}
]

In [None]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

{'score': 0.6385916471481323, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)

[{'summary_text': ' America has changed dramatically during recent years . The '
                  'number of engineering graduates in the U.S. has declined in '
                  'traditional engineering disciplines such as mechanical, civil '
                  ', electrical, chemical, and aeronautical engineering . Rapidly '
                  'developing economies such as China and India, as well as other '
                  'industrial countries in Europe and Asia, continue to encourage '
                  'and advance engineering .'}]

In [None]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

[{'translation_text': 'This course is produced by Hugging Face.'}]

## What exactly is happening behind the `Pipeline`

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

In [None]:
from transformers import pipeline

classifier(
    "A furry friend that loves chasing balls and wagging its tail.",
    candidate_labels=["dog", "cat", "hamster"],
)

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
aw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

In [None]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

In [None]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

In [None]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [None]:
print(outputs.logits.shape)

In [None]:
print(outputs.logits)

In [None]:
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

In [None]:
model.config.id2label

## Working with datasets

Working with good data is the key, especially at early stage as you learn to work with Python.

Hugging Face provides access to some best quality classic datasets that allow you to learn in the smoothest way.

Data is available across all the modalities, such as text, image, audio, video as well as Hugging Face provides convenient tools to work with the data.


[DatasetInfo](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetInfo): provides general info about datasets and is a good place to start

Datasets can be heavy and committing to downloading them can be challenging: Use the `load_dataset_builder()` function to load a dataset builder and inspect a dataset’s attributes without committing to downloading it:



In [None]:
from datasets import load_dataset_builder
ds_builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")

ds_builder.info.description

ds_builder.info.features

README.md: 0.00B [00:00, ?B/s]

{'text': Value('string'), 'label': ClassLabel(names=['neg', 'pos'])}

In [None]:
from datasets import load_dataset

dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

**Splits**
A split is a specific subset of a dataset like train and test. List a dataset’s split names with the get_dataset_split_names() function:

In [None]:
from datasets import get_dataset_split_names

get_dataset_split_names("cornell-movie-review-data/rotten_tomatoes")

['train', 'validation', 'test']

Then you can load a specific split with the split parameter. Loading a dataset split returns a Dataset object:

In [None]:
from datasets import load_dataset

dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")
dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})

If you don’t specify a split, 🤗 Datasets returns a DatasetDict object instead:



In [None]:
from datasets import load_dataset

dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes")

Some datasets contain several sub-datasets. For example, the `MInDS-14` dataset has several sub-datasets, each one containing audio data in a different language. These sub-datasets are known as configurations or subsets, and you must explicitly select one when loading the dataset. If you don’t provide a configuration name, 🤗 Datasets will raise a ValueError and remind you to choose a configuration.

Use the `get_dataset_config_names()` function to retrieve a list of all the possible configurations available to your dataset:

In [None]:
from datasets import get_dataset_config_names

configs = get_dataset_config_names("PolyAI/minds14")
print(configs)

In [None]:
from datasets import load_dataset

mindsFR = load_dataset("PolyAI/minds14", "fr-FR", split="train")

README.md: 0.00B [00:00, ?B/s]

fr-FR/train-00000-of-00001.parquet:   0%|          | 0.00/32.6M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/539 [00:00<?, ? examples/s]

### Iterable vs regular datasets

There are two types of dataset objects, a **regular Dataset** and then an ✨ **IterableDataset** ✨.

A Dataset provides fast **random** access to the rows, and memory-mapping so that loading even large datasets only uses a relatively small amount of device memory. But for really, really big datasets that won’t even fit on disk or in memory, an IterableDataset allows you to access and use the dataset without waiting for it to download completely.

### Dataset

When you load a dataset split, you’ll get a `Dataset` object.

In [None]:
from datasets import load_dataset

dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")

**Indexing**

A Dataset contains columns of data, and each column can be a different type of data. The **index**, or **axis label**, is used to access examples from the dataset.

In [None]:
dataset[0]

{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'label': 1}

to access the last one: -


In [None]:
dataset[-1]

{'text': 'things really get weird , though not particularly scary : the movie is all portent and no content .',
 'label': 0}

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

Indexing by the column name returns a list of all the values in the column:

In [None]:
dataset[0]["text"]

'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'

You can combine row and column name indexing to return a specific value at a position:



In [None]:
dataset[0]["text"]

**Slicing**

Slicing returns a slice - or subset - of the dataset, which is useful for viewing several rows at once. To slice a dataset, use the : operator to specify a range of positions.

In [None]:
dataset[:3]

dataset[3:6]

{'text': ['if you sometimes like to go to the movies to have fun , wasabi is a good place to start .',
  "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .",
  'the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .'],
 'label': [1, 1, 1]}

#### IterableDataset

An IterableDataset is loaded when you set the streaming parameter to True in load_dataset():



In [None]:
from datasets import load_dataset

iterable_dataset = load_dataset("ethz/food101", split="train", streaming=True)
for example in iterable_dataset:
    print(example)
    break

README.md: 0.00B [00:00, ?B/s]

{'image': <PIL.Image.Image image mode=RGB size=384x512 at 0x7C1C22772D50>, 'label': 6}


An IterableDataset’s behavior is different from a regular Dataset. **You don’t get random access to examples in an IterableDataset.** Instead, you should iterate over its elements, for example, by calling next(iter()) or with a for loop to return the next item from the IterableDataset:

In [None]:
next(iter(iterable_dataset))

for example in iterable_dataset:
    print(example)
    break

{'image': <PIL.Image.Image image mode=RGB size=384x512 at 0x7C1C228D26C0>, 'label': 6}


### Preprocess

Preprocessing is one of the critical steps in working with data and Hugging Face makes it easier to do such steps as:

- Tokenize a text dataset.
- Resample an audio dataset.
- Apply transforms to an image dataset.

The last preprocessing step is usually setting your dataset format to be compatible with your machine learning framework’s expected input format.


**Tokenize text**

Models cannot process raw text, so you’ll need to convert the text into numbers. Tokenization provides a way to do this by dividing text into individual words called tokens. Tokens are finally converted to numbers.

1 Start by loading the rotten_tomatoes dataset and the tokenizer corresponding to a pretrained BERT model. Using the same tokenizer as the pretrained model is important because you want to make sure the text is split in the same way.


In [None]:
from transformers import AutoTokenizer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

2. Call your tokenizer on the first row of text in the dataset:



In [None]:
tokenizer(dataset[0]["text"])

{'input_ids': [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1000, 16608, 1000, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

3. The fastest way to tokenize your entire dataset is to use the map() function. This function speeds up tokenization by applying the tokenizer to batches of examples instead of individual examples. Set the batched parameter to True:

In [None]:
def tokenization(example):
    return tokenizer(example["text"])

dataset = dataset.map(tokenization, batched=True)

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

There are also guides to work with

- Image data [Link](https://huggingface.co/docs/datasets/main/en/image_load)
- Audio data: [Link](https://huggingface.co/docs/datasets/main/en/audio_load)



In [None]:
!gzip -dkv SQuAD_it-*.json.gz

SQuAD_it-test.json.gz:	 87.5% -- created SQuAD_it-test.json
SQuAD_it-train.json.gz:	 82.3% -- created SQuAD_it-train.json


In [None]:
from datasets import load_dataset

squad_it_dataset = load_dataset("json", data_files="SQuAD_it-train.json", field="data")

Generating train split: 0 examples [00:00, ? examples/s]