# Getting started on a task with a pipeline

The easiest way to use a pretrained model on a given task is to use `pipeline`. 🤗 Transformers
provides the following tasks out of the box:

- Sentiment analysis: is a text positive or negative?
- Text generation (in English): provide a prompt and the model will generate what follows.
- Name entity recognition (NER): in an input sentence, label each word with the entity it represents (person, place,
  etc.)
- Question answering: provide the model with some context and a question, extract the answer from the context.
- Filling masked text: given a text with masked words (e.g., replaced by `[MASK]`), fill the blanks.
- Summarization: generate a summary of a long text.
- Translation: translate a text in another language.
- Feature extraction: return a tensor representation of the text.

Let's see how this work for sentiment analysis (the other tasks are all covered in the [task summary](https://huggingface.co/transformers/task_summary.html)):

In [1]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')

2023-06-19 12:18:17.306046: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


When typing this command for the first time, a pretrained model and its tokenizer are downloaded and cached. We will
look at both later on, but as an introduction the tokenizer's job is to preprocess the text for the model, which is
then responsible for making predictions. The pipeline groups all of that together, and post-process the predictions to
make them readable. For instance:

In [2]:
classifier('We are happy to show your the 🤗 Transformer library.')

[{'label': 'POSITIVE', 'score': 0.9997890591621399}]

In [5]:
classifier('The pizza is not that the great but the crust is awesome.')

[{'label': 'POSITIVE', 'score': 0.9998530745506287}]

That's encouraging! You can use it on a list of sentences, which will be preprocessed then fed to the model as a
*batch*, returning a list of dictionaries like this one:

In [6]:
results = classifier(["We are happy to show your the 🤗 Transformer library.",
                    "We hope you don't hete it."])
for result in results:
    print(f"label:{result['label']}, with score:{round(result['score'],4)}")

label:POSITIVE, with score:0.9998
label:NEGATIVE, with score:0.9923


You can see the second sentence has been classified as negative (it needs to be positive or negative) but its score is
fairly neutral.

By default, the model downloaded for this pipeline is called "distilbert-base-uncased-finetuned-sst-2-english". We can
look at its [model page](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) to get more
information about it. It uses the [DistilBERT architecture](https://huggingface.co/transformers/model_doc/distilbert.html) and has been fine-tuned on a
dataset called SST-2 for the sentiment analysis task.

Let's say we want to use another model; for instance, one that has been trained on French data. We can search through
the [model hub](https://huggingface.co/models) that gathers models pretrained on a lot of data by research labs, but
also community models (usually fine-tuned versions of those big models on a specific dataset). Applying the tags
"French" and "text-classification" gives back a suggestion "nlptown/bert-base-multilingual-uncased-sentiment". Let's
see how we can use it.

You can directly pass the name of the model to use to `pipeline`:

In [7]:
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

In [8]:
classifier("Esperamos que no lo odie.")

[{'label': '3 stars', 'score': 0.33688199520111084}]

This classifier can now deal with texts in English, French, but also Dutch, German, Italian and Spanish! You can also
replace that name by a local folder where you have saved a pretrained model (see below). You can also pass a model
object and its associated tokenizer.

We will need two classes for this. The first is `AutoTokenizer`, which we will use to download the
tokenizer associated to the model we picked and instantiate it. The second is
`AutoModelForSequenceClassification` (or
`TFAutoModelForSequenceClassification` if you are using TensorFlow), which we will use to download
the model itself. Note that if we were using the library on an other task, the class of the model would change. The
[task summary](https://huggingface.co/transformers/task_summary.html) tutorial summarizes which class is used for which task.

In [9]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

In [10]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
# This model only exists in PyTorch, so we use the `from_pt` flag to import that model in TensorFlow.
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

2023-06-19 13:37:23.692544: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 325260288 exceeds 10% of free system memory.
2023-06-19 13:37:24.013716: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 325260288 exceeds 10% of free system memory.
2023-06-19 13:37:24.073779: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 325260288 exceeds 10% of free system memory.
2023-06-19 13:37:25.322142: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 325260288 exceeds 10% of free system memory.
All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [12]:
classifier("Enhanced, guideline-directed care. Lenus Health helps improve clinical outcomes through more efficient referral and diagnostic workflows, care coordination across settings and specialties, and clinically actionable AI insights. Patient-centred, data-driven care. Lenus Health delivers on strategic programmes including digital home care and early.")

[{'label': '5 stars', 'score': 0.5781165361404419}]

In [13]:
## Under to hood

from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

2023-06-19 13:38:55.183220: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 93763584 exceeds 10% of free system memory.
Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized:

# Using the tokenizer

We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in
words (or part of words, punctuation symbols, etc.) usually called *tokens*. There are multiple rules that can govern
that process (you can learn more about them in the [tokenizer summary](https://huggingface.co/transformers/tokenizer_summary.html)), which is why we need
to instantiate the tokenizer using the name of the model, to make sure we use the same rules as when the model was
pretrained.

The second step is to convert those *tokens* into numbers, to be able to build a tensor out of them and feed them to
the model. To do this, the tokenizer has a *vocab*, which is the part we download when we instantiate it with the
`from_pretrained` method, since we need to use the same *vocab* as when the model was pretrained.

To apply these steps on a given text, we can just feed it to our tokenizer:

In [14]:
inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")

In [15]:
inputs1 = tokenizer("Enhanced, guideline-directed care. Lenus Health helps improve clinical outcomes through more efficient referral and diagnostic workflows, care coordination across settings and specialties, and clinically actionable AI insights. Patient-centred, data-driven care. Lenus Health delivers on strategic programmes including digital home care and early.")


In [16]:
print(inputs)

{'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [17]:
print(inputs1)

{'input_ids': [101, 9412, 1010, 5009, 4179, 1011, 2856, 2729, 1012, 18798, 2271, 2740, 7126, 5335, 6612, 13105, 2083, 2062, 8114, 6523, 7941, 1998, 16474, 2147, 12314, 2015, 1010, 2729, 12016, 2408, 10906, 1998, 2569, 7368, 1010, 1998, 6612, 2135, 2895, 3085, 9932, 20062, 1012, 5776, 1011, 16441, 1010, 2951, 1011, 5533, 2729, 1012, 18798, 2271, 2740, 18058, 2006, 6143, 8497, 2164, 3617, 2188, 2729, 1998, 2220, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [18]:
tf_batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="tf"
)

In [19]:
for key, value in tf_batch.items():
    print(f"{key}: {value.numpy().tolist()}")

input_ids: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]


# Using the model

Once your input has been preprocessed by the tokenizer, you can send it directly to the model. As we mentioned, it will
contain all the relevant information the model needs. If you're using a TensorFlow model, you can pass the dictionary
keys directly to tensors, for a PyTorch model, you need to unpack the dictionary by adding `**`.

In [20]:
tf_outputs = tf_model(tf_batch)

In [21]:
print(tf_outputs)

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-4.083297  ,  4.3364143 ],
       [ 0.08181619, -0.04179142]], dtype=float32)>, hidden_states=None, attentions=None)


 The model can return more than just the final activations, which is why the output is a tuple. Here we only asked for
the final activations, so we get a tuple with one element.

> **NOTE:** All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model **before** the final activation
> function (like SoftMax) since this final activation function is often fused with the loss.

In [22]:
import tensorflow as tf
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)

In [23]:
print(tf_predictions)

tf.Tensor(
[[2.2042973e-04 9.9977952e-01]
 [5.3086263e-01 4.6913740e-01]], shape=(2, 2), dtype=float32)


In [26]:
# import tensorflow as tf
# tf_outputs = tf_model(tf_batch, labels = tf.constant([1, 0]))