# Breaking down Pipeline()

HF has a pipeline() funtion that tokenizes, passes the inputs through the model and post-processes all in one step. We want to break-down this process to better understand transformers.

In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.9598050713539124},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

# Step by Step

## Step 1. Tokenizer 

In [None]:
from transformers import AutoTokenizer

# This command retrieves the tokenizer used during training for the model used in pipeline()
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Turning the prompt/input into tokens
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


## Step 2. Define Model that we want to use for our task

In [None]:
from transformers import AutoModel
from transformers import AutoModelForSequenceClassification

To retrieve full output along with hidden layers, we can use the AutoModel with will return a vector of dimension (sequence size, sequence length, hidden states from the model)

In [None]:
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


For classification, we only need to find the logits for the classes, i.e. logits = W*h + b

In this case W is a 2x768 matrix and h is a 768x1 vector. Which leads to a 2x2 output logits matrix.

In [None]:
# We can specify the classification task in HF using the following
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits.shape)

torch.Size([2, 2])


## Step 3. Convert Raw Logits into Probabilities

From step 2 we want to convert logits into probabilities that we can inteprete.

In [9]:
import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


Now, the model outputs are converted to probabilities that we can interpret. Let's retrieve the labels that the model used for this classification.

In [10]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

So for the first sentence, about 4% negative and 96% positive. For the second sentence, about 99.9% negative and near 0% positive.