In [1]:
from transformers import pipeline

In [2]:
sentiment_classifier = pipeline("sentiment-analysis")
sentiment_classifier

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


<transformers.pipelines.text_classification.TextClassificationPipeline at 0x10817dc40>

In [3]:
sentiment_classifier(
    inputs = [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

## Preprocessing with a tokenizer

Like other neural networks, Transformer models can’t process raw text directly, so the first step of our pipeline is to convert the text inputs into numbers that the model can make sense of. To do this we use a tokenizer, which will be responsible for:

- Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
- Mapping each token to an integer
- Adding additional inputs that may be useful to the model

All this preprocessing needs to be done in exactly the same way as when the model was pretrained.

To do this, we use the `AutoTokenizer` class and its `from_pretrained()` method. Using the `checkpoint` name of our model, it will automatically fetch the data associated with the model’s tokenizer and cache it.

Since the default checkpoint of the `sentiment-analysis pipeline` is `distilbert-base-uncased-finetuned-sst-2-english`, we run the following:

In [4]:
from transformers import AutoTokenizer

In [29]:
checkpoint = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"

In [6]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased-finetuned-sst-2-english', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

Once we have the tokenizer, we can directly pass our sentences to it and we’ll get back a dictionary that’s ready to feed to our model! The only thing left to do is to convert the list of input IDs to tensors.

To specify the type of tensors we want to get back (PyTorch, TensorFlow, or plain NumPy), we use the `return_tensors` argument,

Here’s what the results look like as PyTorch tensors:

In [17]:
from pprint import pprint, PrettyPrinter

In [7]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

In [8]:
token_results = tokenizer(
    raw_inputs, padding=True, truncation=True,
    return_tensors="pt"
)

In [21]:
print(token_results)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


The output itself is a dictionary containing two keys, `input_ids` and `attention_mask`. `input_ids` contains two rows of integers (one for each sentence) that are the unique identifiers of the tokens in each sentence.

In [24]:
token_results['input_ids']

tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]])

In [25]:
token_results['attention_mask']

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])

In [34]:
# Number of sequence 
len(token_results['input_ids'])

2

In [37]:
# Number of tokens per sequence
len(token_results['input_ids'][0]), len(token_results['input_ids'][1])

(16, 16)

## Going through the model

We can download our pretrained model the same way we did with our tokenizer. Transformers provides an `AutoModel` class which also has a `from_pretrained()` method:

In [27]:
from transformers import AutoModel

In [30]:
model = AutoModel.from_pretrained(checkpoint)

In [32]:
# model

This architecture contains only the base Transformer module: given some inputs, it outputs what we’ll call `hidden states`, also known as `features`. 

For each model input, we’ll retrieve a `high-dimensional` vector representing the `contextual understanding of that input by the Transformer model.`

While these hidden states can be useful on their own, they’re usually inputs to another part of the model, known as the `head`. 

**A high-dimensional vector?**

The vector output by the Transformer module is usually large. It generally has `three` dimensions:

- **Batch size**: The number of sequences processed at a time (2 in our example).

- **Sequence length**: The length of the numerical representation of the sequence (16 in our example).

- **Hidden size**: The vector dimension of each model input.

It is said to be `high dimensional` because of the `last value`. The `hidden size` can be very large (768 is common for smaller models, and in larger models this can reach 3072 or more).

We can see this if we feed the inputs we preprocessed to our model:

In [38]:
output = model(**token_results)

In [39]:
output

BaseModelOutput(last_hidden_state=tensor([[[-0.1798,  0.2333,  0.6321,  ..., -0.3017,  0.5008,  0.1481],
         [ 0.2758,  0.6497,  0.3200,  ..., -0.0760,  0.5136,  0.1329],
         [ 0.9046,  0.0985,  0.2950,  ...,  0.3352, -0.1407, -0.6464],
         ...,
         [ 0.1466,  0.5661,  0.3235,  ..., -0.3376,  0.5100, -0.0561],
         [ 0.7500,  0.0487,  0.1738,  ...,  0.4684,  0.0030, -0.6084],
         [ 0.0519,  0.3729,  0.5223,  ...,  0.3584,  0.6500, -0.3883]],

        [[-0.2937,  0.7283, -0.1497,  ..., -0.1187, -1.0227, -0.0422],
         [-0.2206,  0.9384, -0.0951,  ..., -0.3643, -0.6605,  0.2407],
         [-0.1536,  0.8988, -0.0728,  ..., -0.2189, -0.8528,  0.0710],
         ...,
         [-0.3017,  0.9002, -0.0200,  ..., -0.1082, -0.8412, -0.0861],
         [-0.3338,  0.9674, -0.0729,  ..., -0.1952, -0.8181, -0.0634],
         [-0.3454,  0.8824, -0.0426,  ..., -0.0993, -0.8329, -0.1065]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)

Transformers models behave like namedtuples or dictionaries. You can access the elements by attributes (like we did) or by key (`outputs["last_hidden_state"]`), or even by index if you know exactly where the thing you are looking for is (`outputs[0]`).

In [43]:
# to see the dimension of the output
output.last_hidden_state.shape

torch.Size([2, 16, 768])

In [45]:
output['last_hidden_state'].shape

torch.Size([2, 16, 768])

In [47]:
output[0].shape

torch.Size([2, 16, 768])

## Model heads: Making sense out of numbers

The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. They are usually composed of one or a few linear layers. The output of the Transformer model is sent directly to the model head to be processed.

The embeddings layer converts each input ID in the tokenized input into a vector that represents the associated token. The subsequent layers manipulate those vectors using the attention mechanism to produce the final representation of the sentences.

There are many different architectures available in 🤗 Transformers, with each one designed around tackling a specific task. Here is a non-exhaustive list:

```
    *Model (retrieve the hidden states)
    *ForCausalLM
    *ForMaskedLM
    *ForMultipleChoice
    *ForQuestionAnswering
    *ForSequenceClassification
    *ForTokenClassification
```

For our example, we will need a model with a `sequence classification head` (to be able to classify the sentences as positive or negative). So, we won’t actually use the AutoModel class, but `AutoModelForSequenceClassification`.

In [48]:
from transformers import AutoModelForSequenceClassification

In [50]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [60]:
token_results

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

In [56]:
output = model(**token_results)

In [57]:
print(output)

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


In [58]:
output['logits']

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)

In [59]:
output['logits'].shape

torch.Size([2, 2])

Since we have just two sentences and two labels, the result we get from our model is of shape 2 x 2.

## Postprocessing the output

The values we get as output from our model don’t necessarily make sense by themselves. 

In [61]:
print(output.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


Our model predicted `[-1.5607, 1.6123]` for the first sentence and `[ 4.1692, -3.3464]` for the second one. 

Those are not probabilities but `logits`, the raw, unnormalized scores outputted by the last layer of the model. 

To be converted to `probabilities`, they need to go through a `SoftMax` layer (all 🤗 Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy)

In [62]:
import torch.nn.functional as F

In [63]:
predictions = F.softmax(output.logits, dim=-1)
predictions

tensor([[4.0195e-02, 9.5981e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)

In [67]:
# predictions_0 = F.softmax(output.logits, dim=0)
# predictions_0

In [68]:
# predictions_1 = F.softmax(output.logits, dim=1)
# predictions_1

Now we can see that the model predicted `[0.0402, 0.9598]` for the first sentence and `[0.9995, 0.0005]` for the second one. These are recognizable probability scores.

To get the `labels` corresponding to each position, we can inspect the `id2label` attribute of the `model config` 

In [71]:
# model.config

In [70]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

Now we can conclude that the model predicted the following:

```
First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598
Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005
```

### Practice

In [10]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.nn.functional as F

In [2]:
inputs = [
    "This is an absolute joke!",
    "Cristiano Ronaldo is better than Messi.",
    "Life is not about only success, failure plays a vital role too."
]

In [3]:
# Select checkpoint
checkpoint = 'ProsusAI/finbert'
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [78]:
# model.config

In [4]:
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [82]:
# tokenizer

In [5]:
token_inputs = tokenizer(inputs, padding=True, truncation=True, return_tensors="pt")
token_inputs

{'input_ids': tensor([[  101,  2023,  2003,  2019,  7619,  8257,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0],
        [  101, 13675,  2923, 15668,  8923,  2080,  2003,  2488,  2084,  6752,
          2072,  1012,   102,     0,     0,     0],
        [  101,  2166,  2003,  2025,  2055,  2069,  3112,  1010,  4945,  3248,
          1037,  8995,  2535,  2205,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [97]:
pprint(token_inputs['input_ids'])

tensor([[  101,  2023,  2003,  2019,  7619,  8257,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0],
        [  101, 13675,  2923, 15668,  8923,  2080,  2003,  2488,  2084,  6752,
          2072,  1012,   102,     0,     0,     0],
        [  101,  2166,  2003,  2025,  2055,  2069,  3112,  1010,  4945,  3248,
          1037,  8995,  2535,  2205,  1012,   102]])


In [98]:
pprint(token_inputs['token_type_ids'])

tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])


In [99]:
pprint(token_inputs['attention_mask'])

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


In [100]:
len(token_inputs['input_ids'])

3

In [101]:
len(token_inputs['input_ids'][0]), len(token_inputs['input_ids'][1]), len(token_inputs['input_ids'][2])

(16, 16, 16)

In [6]:
# outputs from the model
model_outputs = model(**token_inputs)

In [7]:
model_outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-1.6557,  2.0578,  0.6569],
        [ 0.9340, -2.0800,  1.0718],
        [-1.1098, -0.2361,  1.9260]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [104]:
model_outputs.logits

tensor([[-1.6557,  2.0578,  0.6569],
        [ 0.9340, -2.0800,  1.0718],
        [-1.1098, -0.2361,  1.9260]], grad_fn=<AddmmBackward0>)

In [8]:
logits = model_outputs.logits
logits

tensor([[-1.6557,  2.0578,  0.6569],
        [ 0.9340, -2.0800,  1.0718],
        [-1.1098, -0.2361,  1.9260]], grad_fn=<AddmmBackward0>)

In [11]:
output_probs = F.softmax(logits, dim=-1)
output_probs

tensor([[0.0192, 0.7869, 0.1939],
        [0.4552, 0.0223, 0.5225],
        [0.0413, 0.0989, 0.8598]], grad_fn=<SoftmaxBackward0>)

In [108]:
model.config.id2label

{0: 'positive', 1: 'negative', 2: 'neutral'}

In [170]:
model.config.label2id

{'positive': 0, 'negative': 1, 'neutral': 2}

In [123]:
import torch

In [171]:
def result_output(model, output_probs):
    results = []
    
    for prob_list in output_probs:
        prob = torch.max(prob_list)
        ix = torch.argmax(prob_list)

        if ix == torch.tensor(model.config.label2id['positive']):
            pos = {'label': 'POSITIVE', 'score': prob.item()}
            results.append(pos)

        elif ix == torch.tensor(model.config.label2id['negative']):
            neg = {'label': 'NEGATIVE', 'score': prob.item()}
            results.append(neg)

        elif ix == torch.tensor(model.config.label2id['neutral']):
            neut = {'label': 'NEUTRAL', 'score': prob.item()}
            results.append(neut)

    return results

In [172]:
final_output = result_output(model, output_probs)
final_output

[{'label': 'NEGATIVE', 'score': 0.7869254350662231},
 {'label': 'NEUTRAL', 'score': 0.522458553314209},
 {'label': 'NEUTRAL', 'score': 0.8597527742385864}]

In [18]:
from utils.result_finbert import result_output

In [19]:
final = result_output(model, output_probs)

In [20]:
final

[{'label': 'NEGATIVE', 'score': 0.7869254350662231},
 {'label': 'NEUTRAL', 'score': 0.522458553314209},
 {'label': 'NEUTRAL', 'score': 0.8597527742385864}]