# Behind the pipeline (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading 

In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

## Preprocessing with a Tokenizer

Transformer models, like other neural networks, cannot directly process raw text. Hence, the initial step in our pipeline involves converting text inputs into numerical representations that the model can comprehend. This process is handled by a tokenizer, responsible for:

1. Breaking the input into tokens (words, subwords, or symbols like punctuation)
2. Mapping each token to an integer
3. Adding supplementary inputs beneficial to the model

This preprocessing must mirror the same process applied during the model's pretraining. Therefore, we need to retrieve this information from the Model Hub. The `AutoTokenizer` class and its `from_pretrained()` method help in this task. By using the checkpoint name of our model, it automatically fetches and caches the associated data.


In [3]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

## Tokenizing and Preparing Input

Once we have the tokenizer, we can directly pass our sentences to it and receive a dictionary ready to feed into our model! The final step is converting the list of input IDs to tensors.

While 🤗 Transformers abstracts the underlying ML framework (PyTorch, TensorFlow, or Flax) from the user, Transformer models accept tensors as input. Tensors are akin to NumPy arrays—a scalar (0D), a vector (1D), a matrix (2D), or can have higher dimensions, effectively representing tensors. Other ML framework tensors exhibit similar behavior to NumPy arrays and are usually as simple to create.

To specify the type of tensors we wish to obtain (PyTorch, TensorFlow, or plain NumPy), we utilize the `return_tensors` argument:


In [4]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


## Understanding the Base Transformer Module

This architecture specifically comprises the base Transformer module. When given inputs, it generates what we refer to as hidden states, also recognized as features. Each model input corresponds to a high-dimensional vector obtained from the Transformer model, representing the contextual comprehension of that specific input.


In [7]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

## Understanding High-Dimensional Vectors from the Transformer Module

The vector output from the Transformer module tends to be large and encompasses three primary dimensions:

1. **Batch size:** Represents the number of sequences processed simultaneously (e.g., 2 in our example).
2. **Sequence length:** Denotes the length of the numerical representation of the sequence (e.g., 16 in our example).
3. **Hidden size:** Signifies the vector dimension of each model input, contributing to the high-dimensional aspect of the vector.

The "high dimensional" attribute arises primarily from the hidden size, which can be notably large. For instance, smaller models commonly have a hidden size of 768, while larger models might extend to 3072 or even higher.

This structure becomes evident when we input our preprocessed inputs into the model. The outputs from 🤗 Transformers models are akin to namedtuples or dictionaries. Accessing these elements can be achieved through:

1. **Attributes:** Elements can be accessed using attributes.
   - Example: `outputs.last_hidden_state`

2. **Keys:** Access elements using keys as in dictionaries.
   - Example: `outputs["last_hidden_state"]`

3. **Indices:** If aware of the specific position of the required element, it can be accessed using index.
   - Example: `outputs[0]`


In [8]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


## Understanding Model Heads: Transforming Hidden States

Model heads receive the high-dimensional vector of hidden states as input and project them onto a different dimension. Typically, these heads consist of one or a few linear layers, and they process the output of the Transformer model directly.

The embeddings layer converts each tokenized input ID into a corresponding vector, while subsequent layers manipulate these vectors using the attention mechanism to generate the final sentence representations.

🤗 Transformers offers various architectures tailored for specific tasks, including but not limited to:

- `Model` (retrieves hidden states)
- `ForCausalLM`
- `ForMaskedLM`
- `ForMultipleChoice`
- `ForQuestionAnswering`
- `ForSequenceClassification`
- `ForTokenClassification`
- and others 🤗

For our example, we require a model equipped with a sequence classification head (for classifying sentences as positive or negative). Thus, we'll utilize not the `AutoModel` class, but `AutoModelForSequenceClassification`.


In [9]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

When examining the shape of our outputs, we'll notice a significant reduction in dimensionality. The model head, operating on the previously mentioned high-dimensional vectors, outputs vectors containing only two values, representing each label. As we have two sentences and two labels, the resulting shape from our model is 2 x 2.


In [10]:
print(outputs.logits.shape)

torch.Size([2, 2])


## Postprocessing Model Output

The values obtained from our model are not probabilities but logits, representing raw, unnormalized scores from the last layer. To convert these logits into probabilities, they must pass through a SoftMax layer. All 🤗 Transformers models output logits as the loss function for training typically integrates the last activation function (like SoftMax) with the actual loss function (such as cross-entropy).

In [11]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


In [12]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


- Model prediction before SoftMax:
  - First sentence: [-1.5607, 1.6123]
  - Second sentence: [4.1692, -3.3464]

After applying SoftMax, the model predicted:
- First sentence: [0.0402, 0.9598]
- Second sentence: [0.9995, 0.0005]
  - These scores represent recognizable probability distributions.

To determine the labels for each position, we can inspect the `id2label` attribute of the model config (detailed in the next section). Conclusively, the model predicted:

- First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598
- Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005

This reproduces the three primary steps of the pipeline: preprocessing using tokenizers, model input processing, and postprocessing. Let's delve deeper into each of these steps.


In [13]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}