<a href="https://colab.research.google.com/github/embarced/notebooks/blob/master/deep/transformers-pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers Tasks

* More examples: https://huggingface.co/transformers/task_summary.html

In [1]:
import sys
IN_COLAB = 'google.colab' in sys.modules
IN_COLAB

False

In [2]:
if IN_COLAB:
    # https://huggingface.co/transformers/installation.html
    !pip install -q transformers

In [3]:
import tensorflow as tf
tf.__version__

2023-11-27 09:48:55.008939: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-27 09:48:55.068678: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-27 09:48:55.285848: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-27 09:48:55.285882: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-27 09:48:55.286637: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to regi

'2.14.0'

In [4]:
import transformers
transformers.__version__

'4.35.0'

# Transformer Pipelines 

### Pipelines are made of:

- A [tokenizer](tokenizer) in charge of mapping raw textual input to token.
- A [model](model) to make predictions from the inputs.
- Some (optional) post processing for enhancing model's output.

### Available Tasks (November 2023)
- `"audio-classification"`: will return a [`AudioClassificationPipeline`].
- `"automatic-speech-recognition"`: will return a [`AutomaticSpeechRecognitionPipeline`].
- `"conversational"`: will return a [`ConversationalPipeline`].
- `"depth-estimation"`: will return a [`DepthEstimationPipeline`].
- `"document-question-answering"`: will return a [`DocumentQuestionAnsweringPipeline`].
- `"feature-extraction"`: will return a [`FeatureExtractionPipeline`].
- `"fill-mask"`: will return a [`FillMaskPipeline`]:.
- `"image-classification"`: will return a [`ImageClassificationPipeline`].
- `"image-segmentation"`: will return a [`ImageSegmentationPipeline`].
- `"image-to-image"`: will return a [`ImageToImagePipeline`].
- `"image-to-text"`: will return a [`ImageToTextPipeline`].
- `"mask-generation"`: will return a [`MaskGenerationPipeline`].
- `"object-detection"`: will return a [`ObjectDetectionPipeline`].
- `"question-answering"`: will return a [`QuestionAnsweringPipeline`].
- `"summarization"`: will return a [`SummarizationPipeline`].
- `"table-question-answering"`: will return a [`TableQuestionAnsweringPipeline`].
- `"text2text-generation"`: will return a [`Text2TextGenerationPipeline`].
- `"text-classification"` (alias `"sentiment-analysis"` available): will return a [`TextClassificationPipeline`].
- `"text-generation"`: will return a [`TextGenerationPipeline`]:.
- `"text-to-audio"` (alias `"text-to-speech"` available): will return a [`TextToAudioPipeline`]:.
- `"token-classification"` (alias `"ner"` available): will return a [`TokenClassificationPipeline`].
- `"translation"`: will return a [`TranslationPipeline`].
- `"translation_xx_to_yy"`: will return a [`TranslationPipeline`].
- `"video-classification"`: will return a [`VideoClassificationPipeline`].
- `"visual-question-answering"`: will return a [`VisualQuestionAnsweringPipeline`].
- `"zero-shot-classification"`: will return a [`ZeroShotClassificationPipeline`].
- `"zero-shot-image-classification"`: will return a [`ZeroShotImageClassificationPipeline`].
- `"zero-shot-audio-classification"`: will return a [`ZeroShotAudioClassificationPipeline`].
- `"zero-shot-object-detection"`: will return a [`ZeroShotObjectDetectionPipeline`].


In [5]:
from transformers import pipeline

In [6]:
# shows all possible tasks
pipeline?

[0;31mSignature:[0m
[0mpipeline[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mtask[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmodel[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mForwardRef[0m[0;34m([0m[0;34m'PreTrainedModel'[0m[0;34m)[0m[0;34m,[0m [0mForwardRef[0m[0;34m([0m[0;34m'TFPreTrainedModel'[0m[0;34m)[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mconfig[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mtransformers[0m[0;34m.[0m[0mconfiguration_utils[0m[0;34m.[0m[0mPretrainedConfig[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtokenizer[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mtransformers[0m[0;34m.[0m[0mtokenization_utils[0m[0;34m.[0m[0mPreTrainedTokenizer[0m[0;34m,[0m [0mForward

## Sentiment Analysis

model trained on the glue dataset: https://huggingface.co/datasets/glue


In [7]:
classifier = pipeline(task="sentiment-analysis")
classifier.model.name_or_path

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


'distilbert-base-uncased-finetuned-sst-2-english'

In [8]:
classifier("I hate you")

[{'label': 'NEGATIVE', 'score': 0.9991129040718079}]

In [9]:
classifier("I love you")

[{'label': 'POSITIVE', 'score': 0.9998656511306763}]

In [10]:
# https://huggingface.co/models?pipeline_tag=text-classification&language=de&sort=trending
classifier = pipeline(task="sentiment-analysis", model="oliverguhr/german-sentiment-bert")


In [11]:
classifier("Ich liebe dich")

[{'label': 'positive', 'score': 0.9846151471138}]

In [12]:
classifier("Ick liebe dir")

[{'label': 'positive', 'score': 0.9546653032302856}]

In [13]:
classifier("Ich bin mit diesem Artikel sehr zufrieden")

[{'label': 'positive', 'score': 0.9964079260826111}]

In [14]:
classifier("Ich bin mit diesem Artikel nicht unzufrieden")

[{'label': 'positive', 'score': 0.9832236766815186}]

In [15]:
classifier("Ich bin mit diesem Artikel zufrieden (von wegen)")

[{'label': 'positive', 'score': 0.978872537612915}]

In [16]:
classifier("Ich mag dich nicht")

[{'label': 'negative', 'score': 0.9946463704109192}]

## Extractive Question Answering

* model: https://huggingface.co/distilbert-base-cased-distilled-squad
* dataset: The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text.
  * https://paperswithcode.com/dataset/squad 
  * https://huggingface.co/datasets/squad
  * https://rajpurkar.github.io/SQuAD-explorer/

In [11]:
question_answerer = pipeline("question-answering")
question_answerer.model.name_or_path

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


'distilbert-base-cased-distilled-squad'

In [15]:
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""

In [16]:

result = question_answerer(question="What is extractive question answering?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

Answer: 'the task of extracting an answer from a text given a question', score: 0.6177, start: 34, end: 95


In [17]:

result = question_answerer(question="What is a good example of a question answering dataset?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160


# A bit more low level than a pre-built pipeline - Are two sequences paraphrases of each other?

In [18]:
model_name = 'bert-base-cased-finetuned-mrpc'

## Tokenizer

In [19]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

In [20]:
tokens = tokenizer("Short")
tokens

{'input_ids': [101, 6373, 102], 'token_type_ids': [0, 0, 0], 'attention_mask': [1, 1, 1]}

In [21]:
# whole words can become a single token, CLS and SEP tokens are added
# CLS means classification, SEP means separation
for token in tokens.input_ids:
    print(f"{token} = {tokenizer.decode([token])}")

101 = [CLS]
6373 = Short
102 = [SEP]


In [22]:
# a sequence gets separated by [SEP] token
tokens = tokenizer("first", "second")
for token in tokens.input_ids:
    print(f"{token} = {tokenizer.decode([token])}")

101 = [CLS]
1148 = first
102 = [SEP]
1248 = second
102 = [SEP]


## Model

https://huggingface.co/bert-base-cased-finetuned-mrpc?library=true

### Trained on

Microsoft Research Paraphrase Corpus (MRPC) is a corpus consists of 5,801 sentence pairs collected from newswire articles. Each pair is labelled if it is a paraphrase or not by human annotators. 

https://paperswithcode.com/dataset/mrpc



In [23]:
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [24]:
paraphrase = tokenizer(
    "The company HuggingFace is based in New York City", 
    "HuggingFace's headquarters are situated in Manhattan", 
    return_tensors="tf")
paraphrase_classification_logits = model(paraphrase)[0]
paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
print(f"Probability of being a paraphrase: {paraphrase_results[1]*100:.1f}%")


Probability of being a paraphrase: 90.5%


In [25]:
paraphrase = tokenizer(
    "The company HuggingFace is based in New York City", 
    "Apples are especially bad for your health", 
    return_tensors="tf")
paraphrase_classification_logits = model(paraphrase)[0]
paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
print(f"Probability of being a paraphrase: {paraphrase_results[1]*100:.1f}%")


Probability of being a paraphrase: 6.0%


# German 

Models and datasets are rare, unfortunately

Models
* Base: https://huggingface.co/bert-base-german-cased
* Squad: https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2

Data Sets
* https://tblock.github.io/10kGNAD/


In [26]:
from transformers import TFAutoModelForMaskedLM

# model_name = "distilbert-base-cased"
# model_name = "bert-base-german-cased"
# model_name = "bert-base-german-dbmdz-cased"  # works with Pytorch only
model_name = "bert-base-multilingual-cased"


tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForMaskedLM.from_pretrained(model_name)


All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


In [27]:
# sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."
# sequence = f"Deutschland ist ein tolles Land. Als Bürger hast du das Recht, {tokenizer.mask_token} zu tun was nicht gegen das Gesetz ist."
sequence = f"Deutschland ist ein tolles Land. Als Bürger hast du das Recht, {tokenizer.mask_token} zu tun."
sequence

'Deutschland ist ein tolles Land. Als Bürger hast du das Recht, [MASK] zu tun.'

In [28]:
tokens = tokenizer.encode(sequence)
for token in tokens:
    print(f"{token} = {tokenizer.decode([token])}")

101 = [CLS]
13011 = Deutschland
10298 = ist
10290 = ein
81754 = toll
10171 = ##es
12001 = Land
119 = .
11966 = Als
47473 = Bürger
10393 = has
10123 = ##t
10168 = du
10242 = das
31041 = Recht
117 = ,
103 = [MASK]
10304 = zu
53100 = tun
119 = .
102 = [SEP]


In [29]:
input = tokenizer.encode(sequence, return_tensors="tf")
mask_token_index = tf.where(input == tokenizer.mask_token_id)[0, 1]
mask_token_index

<tf.Tensor: shape=(), dtype=int64, numpy=16>

In [30]:
token_logits = model(input)[0]
token = token_logits.numpy()[0][1].argmax()
print(f"{token} = {tokenizer.decode([token])}")

13011 = Deutschland


In [31]:
mask_token_logits = token_logits[0, mask_token_index, :]
top_5 = tf.math.top_k(mask_token_logits, 5)
probas = tf.nn.softmax(top_5.values).numpy()
indices = top_5.indices.numpy()
probas, indices


(array([0.8575855 , 0.041558  , 0.03632782, 0.03400318, 0.03052554],
       dtype=float32),
 array([13011, 38451, 54760, 27879, 23942], dtype=int32))

In [32]:
for token, proba in zip(indices, probas):
    print(f"{token} = {tokenizer.decode([token])}, proba {proba*100:.1f}%")

13011 = Deutschland, proba 85.8%
38451 = nichts, proba 4.2%
54760 = Freiheit, proba 3.6%
27879 = Deutsch, proba 3.4%
23942 = etwas, proba 3.1%


In [33]:
for token in indices:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

Deutschland ist ein tolles Land. Als Bürger hast du das Recht, Deutschland zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht, nichts zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht, Freiheit zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht, Deutsch zu tun.
Deutschland ist ein tolles Land. Als Bürger hast du das Recht, etwas zu tun.
