## Transformers, what can they do?

Install the Transformers and Datasets libraries to run this notebook.

In [1]:
!pip install datasets transformers[sentencepiece]

Collecting datasets
  Downloading datasets-1.17.0-py3-none-any.whl (306 kB)
Collecting transformers[sentencepiece]
  Using cached transformers-4.15.0-py3-none-any.whl (3.4 MB)
Collecting multiprocess
  Downloading multiprocess-0.70.12.2-py38-none-any.whl (128 kB)
Collecting huggingface-hub<1.0.0,>=0.1.0
  Using cached huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
Collecting xxhash
  Downloading xxhash-2.0.2-cp38-cp38-win_amd64.whl (35 kB)
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp38-cp38-win_amd64.whl (555 kB)
Collecting sacremoses
  Using cached sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp38-cp38-win_amd64.whl (2.0 MB)
Collecting sentencepiece!=0.1.92,>=0.1.91
  Downloading sentencepiece-0.1.96-cp38-cp38-win_amd64.whl (1.1 MB)
Collecting yarl<2.0,>=1.0
  Downloading yarl-1.7.2-cp38-cp38-win_amd64.whl (122 kB)
Collecting frozenlist>=1.1.1
  Downloading frozenlist-1.2.0-cp38-cp38-win_amd64.whl (83 kB)
Coll

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
apache-beam 2.34.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.4 which is incompatible.


The most basic object in the 🤗 Transformers library is the pipeline() function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer:

In [1]:
from transformers import pipeline

In [2]:
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


[{'label': 'POSITIVE', 'score': 0.9598047137260437}]

We can even pass several sentences!

In [3]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598047137260437},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [4]:
classifier("That movie was pretty bad but i really liked it.")

[{'label': 'POSITIVE', 'score': 0.9988355040550232}]

By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English. The model is downloaded and cached when you create the classifier object. If you rerun the command, the cached model will be used instead and there is no need to download the model again.

There are three main steps involved when you pass some text to a pipeline:

The text is preprocessed into a format the model can understand.
The preprocessed inputs are passed to the model.
The predictions of the model are post-processed, so you can make sense of them.
Some of the currently available pipelines are:

- feature-extraction (get the vector representation of a text)
- fill-mask
- ner (named entity recognition)
- question-answering
- sentiment-analysis
- summarization
- text-generation
- translation
- zero-shot-classification

Let’s have a look at a few of these!

### Zero-shot classification
We’ll start by tackling a more challenging task where we need to classify texts that haven’t been labelled. <br>
This is a common scenario in real-world projects because annotating text is usually time-consuming and requires domain expertise. <br>
For this use case, the **zero-shot-classification** pipeline is very powerful: it allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pretrained model. <br>
You’ve already seen how the model can classify a sentence as positive or negative using those two labels — but it can also classify the text using any other set of labels you like.

In [5]:
classifier2 = pipeline("zero-shot-classification")

No model was supplied, defaulted to roberta-large-mnli (https://huggingface.co/roberta-large-mnli)
All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at roberta-large-mnli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


In [6]:
classifier2(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "movies", "sports"],
)

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'movies', 'sports', 'politics'],
 'scores': [0.9088426232337952,
  0.056544046849012375,
  0.01865222677588463,
  0.01596103422343731]}

This pipeline is called **zero-shot** because you don’t need to fine-tune the model on your data to use it. <br>
It can directly return probability scores for any list of labels you want!

### Text generation
Now let’s see how to use a pipeline to generate some text. <br>
The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text. <br>
This is similar to the predictive text feature that is found on many phones. <br> Text generation involves randomness, so it’s normal if you don’t get the same results as shown below.

In [7]:
generator = pipeline("text-generation")

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/475M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [8]:
generator("In this course, we will teach you how to")

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


[{'generated_text': 'In this course, we will teach you how to apply and build a solid foundation for your job. You will work out how much you like to eat, what you like to relax about, what you want to do to improve, and how you might'}]

You can control how many different sequences are generated with the argument **num_return_sequences** and the total length of the output text with the argument **max_length**.

### Using any model from the Hub in a pipeline
The previous examples used the default model for the task at hand, but you can also choose a particular model from the Hub to use in a pipeline for a specific task — say, text generation. <br>
Go to the [Model Hub](https://huggingface.co/models) and click on the corresponding tag on the left to display only the supported models for that task. <br>
You should get to a page like [this one](https://huggingface.co/models?pipeline_tag=text-generation).

Let’s try the [distilgpt2](https://huggingface.co/distilgpt2) model! Here’s how to load it in the same pipeline as before:

In [9]:
generator2 = pipeline("text-generation", model="distilgpt2")
generator2("In this course, we will teach you how to",
           max_length=30, num_return_sequences=2,)

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/313M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at distilgpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


[{'generated_text': 'In this course, we will teach you how to create an account, a login system and you will learn how to use the different features of the application'},
 {'generated_text': 'In this course, we will teach you how to program your skills in computer science and how to apply programming in computer science to the business of education.'}]

### Mask filling
The next pipeline you’ll try is **fill-mask**. <br>
The idea of this task is to fill in the blanks in a given text:

In [10]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models", top_k=2)

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/465M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFRobertaForMaskedLM.

All the layers of TFRobertaForMaskedLM were initialized from the model checkpoint at distilroberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.


Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

[{'score': 0.1963142454624176,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models'},
 {'score': 0.04449177905917168,
  'token': 745,
  'token_str': ' building',
  'sequence': 'This course will teach you all about building models'}]

The **top_k** argument controls how many possibilities you want to be displayed. <br> Note that here the model fills in the special **<"mask">** word, which is often referred to as a mask token. <br> 
Other mask-filling models might have different mask tokens, so it’s always good to verify the proper mask word when exploring other models. <br>
One way to check it is by looking at the mask word used in the widget.

### Named entity recognition
**Named entity recognition (NER)** is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations. <br>
Let’s look at an example:

In [30]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)
Some layers from the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing TFBertForTokenClassification: ['dropout_147']
- This IS expected if you are initializing TFBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForTokenClassification were initialized from the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english.
If your task is similar to the task the

[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

Here the model correctly identified that Sylvain is a person (PER), Hugging Face an organization (ORG), and Brooklyn a location (LOC).

We pass the option **grouped_entities=True** in the pipeline creation function to tell the pipeline to regroup together the parts of the sentence that correspond to the same entity: here the model correctly grouped “Hugging” and “Face” as a single organization, even though the name consists of multiple words. <br>
In fact, as we will see in the next chapter, the preprocessing even splits some words into smaller parts. <br>
For instance, Sylvain is split into four pieces: S, ##yl, ##va, and ##in. 
In the post-processing step, the pipeline successfully regrouped those pieces.

### Question answering
The **question-answering** pipeline answers questions using information from a given context:

In [12]:
question_answerer = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-cased-distilled-squad were not used when initializing TFDistilBertForQuestionAnswering: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-cased-distilled-squad and are newly initialized: ['dropout_262']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [17]:
question_answerer(question="Where do I live?",
                  context="My name is Batuhan and I live in Turkey.")

{'score': 0.980381429195404, 'start': 33, 'end': 39, 'answer': 'Turkey'}

Note that this pipeline works by extracting information from the provided context; it does not generate the answer.

### Summarization
**Summarization** is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text. <br>
Here’s an example:

In [18]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to t5-small (https://huggingface.co/t5-small)


Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

In [21]:
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

[{'summary_text': 'the number of graduates in traditional engineering disciplines has declined . in most of the premier american universities engineering curricula now concentrate on and encourage largely the study of engineering science . rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

In [20]:
summarizer(
"""
After getting a green card in exchange for assassinating a Cuban government official, Tony Montana (Al Pacino) stakes a claim on the drug trade in Miami. 
Viciously murdering anyone who stands in his way, Tony eventually becomes the biggest drug lord in the state, controlling nearly all the cocaine that comes through Miami. 
But increased pressure from the police, wars with Colombian drug cartels and his own drug-fueled paranoia serve to fuel the flames of his eventual downfall.""")

Your max_length is set to 200, but you input_length is only 116. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=58)


[{'summary_text': 'Tony Montana is the biggest drug lord in the state, controlling nearly all the cocaine that comes through Miami . but increased pressure from the police, wars with Colombian drug cartels and his own drug-fueled paranoia fuel the flames .'}]

Like with text generation, you can specify a **max_length** or a **min_length** for the result.

The pipelines shown so far are mostly for demonstrative purposes. <br>
They were programmed for specific tasks and cannot perform variations of them. <br>In the next chapter, you’ll learn what’s inside a **pipeline()** function and how to customize its behavior.