# Using Transformers Pipeline

* Ease of use
* Flexibility
* Simplicity


![image](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline-dark.svg)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import torch
from sklearn.metrics import (accuracy_score, classification_report,
                             confusion_matrix)
from transformers import (AlbertForSequenceClassification, AutoModel,
                          AutoModelForSequenceClassification, AutoTokenizer,
                          BertForSequenceClassification,
                          DistilBertForSequenceClassification, RobertaModel,
                          pipeline)

from data_processing import build_text_data, load_tabular_data, split_dataset

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {DEVICE}")

Using device: cuda


## Processing local data

* prepare the local data to be sentences describing the job
* save the processed data into `./output.csv` with columns to be a set of features and `label`
* convert the dataframe into `Dataset` object for transformers

In [3]:
''' preprocess data '''
tab_df = load_tabular_data()
print(tab_df.head())

text_df = build_text_data(df=tab_df)
print(text_df.head())

    wms_delay  queue_delay  runtime  post_script_delay  stage_in_delay  \
0         1.0          4.0      6.0                5.0             NaN   
76        6.0         20.0   2078.0                5.0            53.0   
75        6.0         20.0   1507.0                5.0            71.0   
74        6.0         20.0   5177.0                5.0            58.0   
73        6.0         20.0   5112.0                5.0            43.0   

    stage_out_delay  stage_in_bytes  stage_out_bytes  \
0               NaN             NaN              NaN   
76              5.0    1.014283e+09         388321.0   
75              4.0    1.014283e+09         357330.0   
74              5.0    1.014283e+09         338748.0   
73              5.0    1.014283e+09         466869.0   

    kickstart_executables_cpu_time  label  
0                              0.2      0  
76                           769.1      1  
75                           785.8      1  
74                           793.6      1 

### Using pipeline directly

* pipeline is a high-level API for transformers
* call the pipeline with the NLP task name (e.g. `text-classification`)
  * by default, the text classification pipeline is the sentiment analysis, 
  which provide the sentiment score for each sentence (positive/negative).
* send a sentence directly to the pipeline


In [4]:
# take a look at the first text, with label 0
print(text_df.loc[0,:])

text     wms_delay is 1.0 queue_delay is 4.0 runtime is...
label                                                    0
Name: 0, dtype: object


In [5]:
# specify the task as `text-classification` 
clf = pipeline("text-classification")
clf(text_df.loc[0,:]["text"])

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.9977535605430603}]

In [6]:
# specify the task as `zero-shot-classification`
zsc = pipeline("zero-shot-classification")
zsc(text_df.loc[0,:]["text"], ["normal", "abnormal"])

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'wms_delay is 1.0 queue_delay is 4.0 runtime is 6.0 post_script_delay is 5.0 stage_in_delay is nan stage_out_delay is nan stage_in_bytes is nan stage_out_bytes is nan kickstart_executables_cpu_time is 0.2 ',
 'labels': ['abnormal', 'normal'],
 'scores': [0.8415219783782959, 0.1584780067205429]}

__NOTE__:
* without specify the checkpoint (which weight to be used), the pipeline will use the default checkpoint for the task.
* both the `text-classification` and `zero-shot-classification` pipelines failed on the first sentence.

Now, consider the `tokenzier` and `model` to be used in the pipeline.

## Tokenizer

In [7]:
# ckp = "albert-base-v2"
# ckp = "roberta-base"
# ckp = "bert-base-uncased"
ckp = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(ckp)

In [8]:
# example of tokenize a sentence
tokens = tokenizer.tokenize("wms_delay")
ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"tokens (subwords):     {tokens}")
print(f"ids (inputs to model): {ids}")

tokens (subwords):     ['w', '##ms', '_', 'delay']
ids (inputs to model): [1059, 5244, 1035, 8536]


In [9]:
# the `encode` method outputs ids include special tokens
print(tokenizer.encode("delay is "))

[101, 8536, 2003, 102]


In [10]:
text_df.loc[:3,:]['text'].tolist()

['wms_delay is 1.0 queue_delay is 4.0 runtime is 6.0 post_script_delay is 5.0 stage_in_delay is nan stage_out_delay is nan stage_in_bytes is nan stage_out_bytes is nan kickstart_executables_cpu_time is 0.2 ',
 'wms_delay is 6.0 queue_delay is 20.0 runtime is 2078.0 post_script_delay is 5.0 stage_in_delay is 53.0 stage_out_delay is 5.0 stage_in_bytes is 1014283276.0 stage_out_bytes is 388321.0 kickstart_executables_cpu_time is 769.1 ',
 'wms_delay is 6.0 queue_delay is 20.0 runtime is 1507.0 post_script_delay is 5.0 stage_in_delay is 71.0 stage_out_delay is 4.0 stage_in_bytes is 1014283276.0 stage_out_bytes is 357330.0 kickstart_executables_cpu_time is 785.8 ',
 'wms_delay is 6.0 queue_delay is 20.0 runtime is 5177.0 post_script_delay is 5.0 stage_in_delay is 58.0 stage_out_delay is 5.0 stage_in_bytes is 1014283276.0 stage_out_bytes is 338748.0 kickstart_executables_cpu_time is 793.6 ']

In [11]:
inputs = tokenizer(text_df.loc[:3,:]["text"].tolist(), padding=True,truncation=True, return_tensors="pt")
inputs_labels = text_df.loc[:3,:]["label"].tolist()
print(inputs)

{'input_ids': tensor([[  101,  1059,  5244,  1035,  8536,  2003,  1015,  1012,  1014, 24240,
          1035,  8536,  2003,  1018,  1012,  1014,  2448,  7292,  2003,  1020,
          1012,  1014,  2695,  1035,  5896,  1035,  8536,  2003,  1019,  1012,
          1014,  2754,  1035,  1999,  1035,  8536,  2003, 16660,  2754,  1035,
          2041,  1035,  8536,  2003, 16660,  2754,  1035,  1999,  1035, 27507,
          2003, 16660,  2754,  1035,  2041,  1035, 27507,  2003, 16660, 14590,
          7559,  2102,  1035,  4654,  8586, 23056,  2015,  1035, 17368,  1035,
          2051,  2003,  1014,  1012,  1016,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0],
        [  101,  1059,  5244,  1035,  8536,  2003,  1020,  1012,  1014, 24240,
          1035,  8536,  2003,  2322,  1012,  1014,  2448,  7292,  2003, 19843,
          2620,  1012,  1014,  2695,  1035,  5896,  1035,  8536,  2003,  1019,


The output itself is a dictionary containing two keys, `input_ids` and `attention_mask`. `input_ids` contains two rows of integers (one for each sentence) that are the unique identifiers of the tokens in each sentence. We’ll explain what the `attention_mask` is later in this chapter.

The input_ids have special ids, `101` and `102` are the beginning and end of the sentence, respectively.

### Going through the model

In [12]:
# start with a AutoModel - BertModel without specific task
model = AutoModel.from_pretrained(ckp)
print(model)

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Li

In [13]:
# for this model, it cannot be used for classification directly
model(**inputs)

BaseModelOutput(last_hidden_state=tensor([[[-0.5112,  0.3341, -0.2621,  ...,  0.0405, -0.3250, -0.0681],
         [-0.3345, -0.0419, -0.4377,  ...,  0.0296, -0.2275,  0.3376],
         [-0.7079,  0.6680, -0.3945,  ...,  0.1466, -0.4400, -0.2390],
         ...,
         [-0.6479,  0.1558, -0.4304,  ...,  0.3244, -0.3796, -0.0942],
         [-0.5215,  0.2106, -0.4026,  ...,  0.1038, -0.4682,  0.0129],
         [-0.3970,  0.1923, -0.3712,  ...,  0.1037, -0.5123, -0.0208]],

        [[-0.4192,  0.3030, -0.2822,  ...,  0.0500, -0.2020, -0.0663],
         [-0.3530, -0.0792, -0.3236,  ...,  0.0387, -0.0820,  0.3589],
         [-0.7125,  0.6483, -0.3507,  ...,  0.1025, -0.3564, -0.1653],
         ...,
         [-0.7792, -0.0933, -0.5075,  ...,  0.2453, -0.0520, -0.3853],
         [-0.9141,  0.2545, -0.4900,  ...,  0.1813,  0.0997,  0.1334],
         [-0.0699,  0.6108, -0.2630,  ..., -0.1825, -0.4212, -0.6291]],

        [[-0.4244,  0.3019, -0.2419,  ...,  0.0480, -0.1983, -0.0835],
         [-

### Model heads: Making sense out of numbers
![image](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/transformer_and_head-dark.svg)

Now, consider the model for `text-classification` task.

In [14]:
model = DistilBertForSequenceClassification.from_pretrained(ckp, num_labels=2)
outputs = model(**inputs)
print("output: \n", outputs)
print("logits: \n", outputs.logits)
print("prob. : \n", torch.nn.functional.softmax(outputs.logits, dim=-1))
print("labels: \n", outputs.logits.argmax(dim=-1))
print("true_labels: \n", inputs_labels)

output: 
 SequenceClassifierOutput(loss=None, logits=tensor([[ 3.3347, -2.7614],
        [ 3.1755, -2.6341],
        [ 3.1681, -2.6305],
        [ 3.2287, -2.6752]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
logits: 
 tensor([[ 3.3347, -2.7614],
        [ 3.1755, -2.6341],
        [ 3.1681, -2.6305],
        [ 3.2287, -2.6752]], grad_fn=<AddmmBackward0>)
prob. : 
 tensor([[0.9978, 0.0022],
        [0.9970, 0.0030],
        [0.9970, 0.0030],
        [0.9973, 0.0027]], grad_fn=<SoftmaxBackward0>)
labels: 
 tensor([0, 0, 0, 0])
true_labels: 
 [0, 1, 1, 1]


In [15]:
print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

### Handling multiple sentences


In [16]:
# first sentence
tokens = tokenizer.tokenize([text_df.loc[0,:]["text"]][0])
print("Tokens", tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print("IDs   ",ids)
output = model(torch.tensor([ids]))
print("Logits", output.logits)
prob = torch.nn.functional.softmax(output.logits, dim=-1)
print("Prob. ", prob)


Tokens ['w', '##ms', '_', 'delay', 'is', '1', '.', '0', 'queue', '_', 'delay', 'is', '4', '.', '0', 'run', '##time', 'is', '6', '.', '0', 'post', '_', 'script', '_', 'delay', 'is', '5', '.', '0', 'stage', '_', 'in', '_', 'delay', 'is', 'nan', 'stage', '_', 'out', '_', 'delay', 'is', 'nan', 'stage', '_', 'in', '_', 'bytes', 'is', 'nan', 'stage', '_', 'out', '_', 'bytes', 'is', 'nan', 'kicks', '##tar', '##t', '_', 'ex', '##ec', '##utable', '##s', '_', 'cpu', '_', 'time', 'is', '0', '.', '2']
IDs    [1059, 5244, 1035, 8536, 2003, 1015, 1012, 1014, 24240, 1035, 8536, 2003, 1018, 1012, 1014, 2448, 7292, 2003, 1020, 1012, 1014, 2695, 1035, 5896, 1035, 8536, 2003, 1019, 1012, 1014, 2754, 1035, 1999, 1035, 8536, 2003, 16660, 2754, 1035, 2041, 1035, 8536, 2003, 16660, 2754, 1035, 1999, 1035, 27507, 2003, 16660, 2754, 1035, 2041, 1035, 27507, 2003, 16660, 14590, 7559, 2102, 1035, 4654, 8586, 23056, 2015, 1035, 17368, 1035, 2051, 2003, 1014, 1012, 1016]
Logits tensor([[ 3.7297, -3.0441]], grad_fn

__NOTE__:
* the tokens are splitted into subwords, or integers
* logits are the model output
* probabilities indicates the confidence of the model on the prediction between [0, 1]

In [17]:
'''
df = pd.read_csv("output.csv")
torch.cuda.empty_cache()
y_pred = []
for i in range(len(df)):
    # tokers = tokenizer([df['text'][i]], padding=True, truncation=True, return_tensors="pt").to(DEVICE)
    # outputs = model(**tokers)
    y_pred.append(int(clf(df['text'][i])[0]["label"].split("_")[1]))
y_true = df["label"].tolist()
# inputs = tokenizer(df["text"].tolist()[:1000], padding=True, truncation=True, return_tensors="pt").to(DEVICE)
# outputs = model(**inputs)
# outputs.logits.argmax(1)
classification_report(y_true, y_pred)
'''

'\ndf = pd.read_csv("output.csv")\ntorch.cuda.empty_cache()\ny_pred = []\nfor i in range(len(df)):\n    # tokers = tokenizer([df[\'text\'][i]], padding=True, truncation=True, return_tensors="pt").to(DEVICE)\n    # outputs = model(**tokers)\n    y_pred.append(int(clf(df[\'text\'][i])[0]["label"].split("_")[1]))\ny_true = df["label"].tolist()\n# inputs = tokenizer(df["text"].tolist()[:1000], padding=True, truncation=True, return_tensors="pt").to(DEVICE)\n# outputs = model(**inputs)\n# outputs.logits.argmax(1)\nclassification_report(y_true, y_pred)\n'

### wrapping up: from tokenizer to model

In [18]:
inputs = tokenizer(text_df['text'].tolist()[:100], padding=True, truncation=True, return_tensors="pt")
output = model(**inputs)
print("pred labels", output.logits.argmax(1))
print("true labels", text_df['label'].tolist()[:100])

pred labels tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0])
true labels [0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1]


In [19]:
from sklearn.metrics import accuracy_score, classification_report


In [20]:
rep = classification_report(output.logits.argmax(1).detach().cpu().numpy(), text_df['label'].tolist()[:100])
print(rep)

              precision    recall  f1-score   support

           0       1.00      0.26      0.41       100
           1       0.00      0.00      0.00         0

    accuracy                           0.26       100
   macro avg       0.50      0.13      0.21       100
weighted avg       1.00      0.26      0.41       100



__NOTE__:
* tokenizer and model take all the input is inefficient. `OOM` issue on both CPU and GPU.
* 

## Supervised Fine-Tuning


In [1]:
from datasets import load_dataset

In [2]:
# TODO: replace with your own dataset
raw_datasets = load_dataset("csv", data_files={"train": "./data/train.csv", "validation": "./data/validation.csv", "test": "./data/test.csv"})
# raw_datasets


Found cached dataset csv (/tmp/jinh/huggingface/datasets/csv/default-0a7c04ab8c22fc34/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)


  0%|          | 0/3 [00:00<?, ?it/s]

In [3]:
# setup tokenizer function
from transformers import DataCollatorWithPadding
def tokenizer_function(data):
    return tokenizer(data["text"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenizer_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/38469 [00:00<?, ? examples/s]

NameError: name 'tokenizer' is not defined

In [66]:
# trainer API
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="sft", 
                                  save_strategy="epoch", overwrite_output_dir=True)
trainer = Trainer(model, training_args, 
                  train_dataset=tokenized_datasets["train"], 
                  eval_dataset=tokenized_datasets["validation"], 
                  data_collator=data_collator, 
                  tokenizer=tokenizer)

trainer.train()



Step,Training Loss
500,0.541
1000,0.5597
1500,0.5431
2000,0.543
2500,0.5302
3000,0.5501
3500,0.5498
4000,0.5315
4500,0.5439
5000,0.5479


TrainOutput(global_step=14427, training_loss=0.5403152088237468, metrics={'train_runtime': 650.842, 'train_samples_per_second': 177.32, 'train_steps_per_second': 22.167, 'total_flos': 298587208529160.0, 'train_loss': 0.5403152088237468, 'epoch': 3.0})

In [67]:
# NOTE: evaluate on the validation set
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(4809, 2) (4809,)


In [68]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)


In [69]:
(preds==0).sum()

106

In [70]:
import evaluate

metric = evaluate.combine(["accuracy", "f1", "precision", "recall"])
metric.compute(predictions=preds, references=predictions.label_ids)


{'accuracy': 0.7602412143896861,
 'f1': 0.8602932267054405,
 'precision': 0.7548373378694451,
 'recall': 1.0}

In [47]:
accuracy_score(predictions.label_ids, preds)

1.0

In [50]:
# wrapping everything together, we can define a `compute_metrics` function and put in trainer
def compute_metrics(eval_preds):
    metric = evaluate.combine(["accuracy", "f1", "precision", "recall"])
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)
    # return accuracy_score(labels, predictions)

In [51]:
# update trainer
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
trainer.train()

Step,Training Loss
500,0.0
1000,0.0
1500,0.0134


TrainOutput(global_step=1803, training_loss=0.0037077239942009204, metrics={'train_runtime': 88.7796, 'train_samples_per_second': 162.47, 'train_steps_per_second': 20.309, 'total_flos': 351620221110624.0, 'train_loss': 0.0037077239942009204, 'epoch': 3.0})

In [52]:
predictions = trainer.predict(tokenized_datasets["test"])
compute_metrics(predictions)

ValueError: too many values to unpack (expected 2)

In [54]:
predictions.metrics


{'test_loss': 3.4706106877280263e-09,
 'test_accuracy': 1.0,
 'test_f1': 1.0,
 'test_precision': 1.0,
 'test_recall': 1.0,
 'test_runtime': 58.3066,
 'test_samples_per_second': 659.788,
 'test_steps_per_second': 82.478}

In [72]:
num_params = sum(p.numel() for p in model.parameters())
print("Number of parameters: {:,}".format(num_params))


Number of parameters: 66,955,010
