<a href="https://colab.research.google.com/github/embarced/notebooks/blob/master/deep/transformers-pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers: examples from basic pipeline tasks

Examples taken from https://huggingface.co/transformers/task_summary.html

In [1]:
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
tf.__version__

'2.7.0'

In [2]:
# when we are not training, we do not need a GPU
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [3]:
# https://huggingface.co/transformers/installation.html
!pip install -q transformers

[K     |████████████████████████████████| 3.4 MB 5.6 MB/s 
[K     |████████████████████████████████| 895 kB 49.1 MB/s 
[K     |████████████████████████████████| 596 kB 51.6 MB/s 
[K     |████████████████████████████████| 3.3 MB 35.1 MB/s 
[K     |████████████████████████████████| 67 kB 5.3 MB/s 
[?25h

In [4]:
import transformers
transformers.__version__

'4.15.0'

In [5]:
from transformers import pipeline

In [6]:
# shows all possible tasks
pipeline?

## Sentiment Analysis

model trained on the glue dataset: https://huggingface.co/datasets/glue


In [7]:
classifier = pipeline(task="sentiment-analysis")
classifier.model.name_or_path

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

'distilbert-base-uncased-finetuned-sst-2-english'

In [8]:
classifier("I hate you")

[{'label': 'NEGATIVE', 'score': 0.9991129040718079}]

In [9]:
classifier("I love you")

[{'label': 'POSITIVE', 'score': 0.9998656511306763}]

## Are two sequences paraphrases of each other?

Microsoft Research Paraphrase Corpus (MRPC) is a corpus consists of 5,801 sentence pairs collected from newswire articles. Each pair is labelled if it is a paraphrase or not by human annotators. 

https://paperswithcode.com/dataset/mrpc

In [16]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "HuggingFace's headquarters are situated in Manhattan"
sequence_2 = "Apples are especially bad for your health"

# The tokekenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to the sequence, as well as compute the attention masks.
paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")
not_paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")

paraphrase_classification_logits = model(paraphrase)[0]
not_paraphrase_classification_logits = model(not_paraphrase)[0]

paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0]

# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")

# Should not be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")

Some layers from the model checkpoint at bert-base-cased-finetuned-mrpc were not used when initializing TFBertForSequenceClassification: ['dropout_183']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at bert-base-cased-finetuned-mrpc.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


not paraphrase: 10%
is paraphrase: 90%
not paraphrase: 94%
is paraphrase: 6%


In [17]:
paraphrase

{'input_ids': <tf.Tensor: shape=(1, 26), dtype=int32, numpy=
array([[  101,  1109,  1419, 20164, 10932,  2271,  7954,  1110,  1359,
         1107,  1203,  1365,  1392,   102, 20164, 10932,  2271,  7954,
          112,   188,  3834,  1132,  3629,  1107,  6545,   102]],
      dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(1, 26), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 26), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1]], dtype=int32)>}

In [18]:
paraphrase.input_ids

<tf.Tensor: shape=(1, 26), dtype=int32, numpy=
array([[  101,  1109,  1419, 20164, 10932,  2271,  7954,  1110,  1359,
         1107,  1203,  1365,  1392,   102, 20164, 10932,  2271,  7954,
          112,   188,  3834,  1132,  3629,  1107,  6545,   102]],
      dtype=int32)>

In [19]:
token = paraphrase.input_ids[0][0]
token

<tf.Tensor: shape=(), dtype=int32, numpy=101>

In [20]:
tokenizer.decode([token])

'[CLS]'

In [21]:
tokenizer.decode([102])

'[SEP]'

In [22]:
tokenizer.decode(paraphrase.input_ids[0][1])

'The'

In [23]:
paraphrase_classification_logits.numpy()

array([[-0.34945548,  1.9003881 ]], dtype=float32)

In [24]:
tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]

array([0.09536295, 0.90463704], dtype=float32)

## Extractive Question Answering

https://huggingface.co/transformers/task_summary.html#extractive-question-answering

In [25]:
question_answerer = pipeline("question-answering")
question_answerer.model.name_or_path

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

'distilbert-base-cased-distilled-squad'

In [26]:
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""

result = question_answerer(question="What is extractive question answering?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

result = question_answerer(question="What is a good example of a question answering dataset?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

Answer: 'the task of extracting an answer from a text given a question', score: 0.6177, start: 34, end: 95
Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160


# Masked Language Modelling

https://huggingface.co/transformers/task_summary.html#masked-language-modeling

In [27]:
from transformers import TFAutoModelForMaskedLM, AutoTokenizer
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = TFAutoModelForMaskedLM.from_pretrained("distilbert-base-cased")

sequence = "Distilled models are smaller than the models they mimic. Using them instead of the large " \
    f"versions would help {tokenizer.mask_token} our carbon footprint."

inputs = tokenizer(sequence, return_tensors="tf")
mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]

token_logits = model(**inputs).logits
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = tf.math.top_k(mask_token_logits, 5).indices.numpy()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/338M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-cased were not used when initializing TFDistilBertForMaskedLM: ['activation_13']
- This IS expected if you are initializing TFDistilBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertForMaskedLM were initialized from the model checkpoint at distilbert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.


Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.


## A more realistic example

Our focus on ________ may seem irrational to outsiders.

https://twitter.com/johncutlefish/status/1433640784457207848


In [28]:
sequence = f"Our focus on {tokenizer.mask_token} may seem irrational to outsiders."

inputs = tokenizer(sequence, return_tensors="tf")
mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]

token_logits = model(**inputs).logits
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = tf.math.top_k(mask_token_logits, 5).indices.numpy()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

Our focus on religion may seem irrational to outsiders.
Our focus on spirituality may seem irrational to outsiders.
Our focus on morality may seem irrational to outsiders.
Our focus on nature may seem irrational to outsiders.
Our focus on politics may seem irrational to outsiders.


## More than one masked token (not as inpressive, though)

You can have an amazing company culture but at a certain point, it becomes stale. People don't leave for new opportunities. It is amazing at _______, but not longer amazing at _______.

https://twitter.com/johncutlefish/status/1436725064943034374



In [30]:
sequence = "You can have an amazing company culture but at a certain point, it becomes stale."  \
           f"People don't leave for new opportunities. It is amazing at {tokenizer.mask_token}, but not longer amazing at {tokenizer.mask_token}."

inputs = tokenizer(sequence, return_tensors="tf")
inputs

{'input_ids': <tf.Tensor: shape=(1, 42), dtype=int32, numpy=
array([[  101,  1192,  1169,  1138,  1126,  6929,  1419,  2754,  1133,
         1120,   170,  2218,  1553,   117,  1122,  3316,   188, 15903,
          119,  2563,  1274,   112,   189,  1817,  1111,  1207,  6305,
          119,  1135,  1110,  6929,  1120,   103,   117,  1133,  1136,
         2039,  6929,  1120,   103,   119,   102]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 42), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
      dtype=int32)>}

In [31]:
# for every token (42) we have a prediction expressed as probability over vocab (28996)
token_logits = model(**inputs).logits
token_logits.shape

TensorShape([1, 42, 28996])

In [32]:
# positions of the masked tokens (id=103)
mask_token_indeces = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[:, 1]
mask_token_indeces

<tf.Tensor: shape=(2,), dtype=int64, numpy=array([32, 39])>

### First word

In [33]:
mask_token_logits = token_logits[0, mask_token_indeces[0], :]

top_5_tokens = tf.math.top_k(mask_token_logits, 5).indices.numpy()

for token in top_5_tokens:
  word = tokenizer.decode([token])
  print(word)

all
times
weddings
first
leisure


### Second word

In [34]:
mask_token_logits = token_logits[0, mask_token_indeces[1], :]

top_5_tokens = tf.math.top_k(mask_token_logits, 5).indices.numpy()

for token in top_5_tokens:
  word = tokenizer.decode([token])
  print(word)

all
leisure
weekends
times
weddings


# German 

Models and datasets are rare, unfortunately

Models
* Base: https://huggingface.co/bert-base-german-cased
* Squad: https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2

Data Sets
* https://tblock.github.io/10kGNAD/


In [36]:
from transformers import TFAutoModelForMaskedLM, AutoTokenizer
import tensorflow as tf

# model_name = "distilbert-base-cased"
model_name = "bert-base-german-cased"
# works with Pytorch only
# model_name = "bert-base-german-dbmdz-cased"
# model_name = "bert-base-multilingual-cased"


tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForMaskedLM.from_pretrained(model_name)

# sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."
sequence = f"Deutschland ist ein tolles Land. Als Bürger hast du das Recht, {tokenizer.mask_token} zu tun was nicht gegen das Gesetz ist."

input = tokenizer.encode(sequence, return_tensors="tf")
mask_token_index = tf.where(input == tokenizer.mask_token_id)[0, 1]

token_logits = model(input)[0]
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = tf.math.top_k(mask_token_logits, 5).indices.numpy()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

All model checkpoint layers were used when initializing TFBertForMaskedLM.

All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-base-german-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


Deutschland ist ein tolles Land. Als Bürger hast du das Recht, etwas zu tun was nicht gegen das Gesetz ist.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht, alles zu tun was nicht gegen das Gesetz ist.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht, nichts zu tun was nicht gegen das Gesetz ist.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht, Dinge zu tun was nicht gegen das Gesetz ist.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht, das zu tun was nicht gegen das Gesetz ist.


In [37]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

model_name = "bert-base-german-dbmdz-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."
sequence = f"Deutschland ist ein tolles Land. Als Bürger hast du das Recht, {tokenizer.mask_token} zu tun."

input = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]

token_logits = model(input)[0]
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/234k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/468k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/422M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-german-dbmdz-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Deutschland ist ein tolles Land. Als Bürger hast du das Recht, etwas zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht, das zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht, es zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht, nichts zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht, dies zu tun.


In [38]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

model_name = "bert-base-multilingual-cased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."
sequence = f"Deutschland ist ein tolles Land. Als Bürger hast du das Recht, {tokenizer.mask_token} zu tun."

input = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]

token_logits = model(input)[0]
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.87M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/681M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Deutschland ist ein tolles Land. Als Bürger hast du das Recht, Deutschland zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht, nichts zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht, Freiheit zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht, Deutsch zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht, etwas zu tun.
