# Working with huggingface transformers in tensorflow.




## Working with pipelines
The most basic object in the 🤗 Transformers library is the pipeline. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer

In [1]:
from transformers import pipeline

In [8]:
# Pipeline covers preprocessing->model->post-processing
classifier = pipeline("sentiment-analysis")  # Download and cache classifier object. Default is english

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are 

In [12]:
# Pass multiple sentences to the classifier.
classifier(
    ["I love Hamburg. But it's always rainy.",
    "The movie was great and the actors were bad.",
    "No man ever steps in the same river twice, for it's not the same river not the same",
    "I do not at all agree with the results of this useless classifier."]
    )

[{'label': 'NEGATIVE', 'score': 0.990067720413208},
 {'label': 'NEGATIVE', 'score': 0.99828040599823},
 {'label': 'NEGATIVE', 'score': 0.9944549202919006},
 {'label': 'NEGATIVE', 'score': 0.9997929930686951}]

### What other pipelines do we have?

- feature-extraction (get the vector representation of a text)
- fill-mask
- ner (named entity recognition)
- question-answering
- sentiment-analysis
- summarization
- text-generation
- translation
- zero-shot-classification

## Zero-shot classification

We’ll start by tackling a more challenging task where we need to classify texts that haven’t been labelled. This is a common scenario in real-world projects because annotating text is usually time-consuming and requires domain expertise. For this use case, the zero-shot-classification pipeline is very powerful: it allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pretrained model. You’ve already seen how the model can classify a sentence as positive or negative using those two labels — but it can also classify the text using any other set of labels you like.

In [13]:
classifier = pipeline("zero-shot-classification")

No model was supplied, defaulted to roberta-large-mnli (https://huggingface.co/roberta-large-mnli)
Downloading: 100%|██████████| 688/688 [00:00<00:00, 369kB/s]
Downloading: 100%|██████████| 1.33G/1.33G [02:09<00:00, 11.0MB/s]
All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at roberta-large-mnli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.
Downloading: 100%|██████████| 878k/878k [00:00<00:00, 1.16MB/s]
Downloading: 100%|██████████| 446k/446k [00:01<00:00, 334kB/s]
Downloading: 100%|██████████| 1.29M/1.29M [00:02<00:00, 655kB/s]


In [19]:
# This will return the probability for each label in sorted array (decreasing probabilty)

classifier(
    ["This is a course about the Transformers library",
    "The ampel coalition is planning to build the new government in december."],
    candidate_labels=["education", "politics", "business"]
    )

[{'sequence': 'This is a course about the Transformers library',
  'labels': ['education', 'business', 'politics'],
  'scores': [0.956234335899353, 0.026972245424985886, 0.016793372109532356]},
 {'sequence': 'The ampel coalition is planning to build the new government in december.',
  'labels': ['politics', 'business', 'education'],
  'scores': [0.9327414631843567, 0.04263611510396004, 0.024622347205877304]}]

## Text generation

Now let’s see how to use a pipeline to generate some text. The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text. This is similar to the predictive text feature that is found on many phones. Text generation involves randomness, so it’s normal if you don’t get the same results as shown below.

In [20]:
generator = pipeline("text-generation")

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)
Downloading: 100%|██████████| 665/665 [00:00<00:00, 267kB/s]
Downloading: 100%|██████████| 475M/475M [00:46<00:00, 10.7MB/s]
All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
Downloading: 100%|██████████| 0.99M/0.99M [00:01<00:00, 1.01MB/s]
Downloading: 100%|██████████| 446k/446k [00:00<00:00, 679kB/s]
Downloading: 100%|██████████| 1.29M/1.29M [00:02<00:00, 465kB/s]


In [21]:
generator("As a Data Scientist at Adobe I mostly work with")

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


[{'generated_text': 'As a Data Scientist at Adobe I mostly work with data security and privacy issues, such as security on computers.\n\nI do not currently see the need to hire a specialist software Engineer to support Adobe. But a good fit is my desire to provide'}]

In [23]:
generator("As a Data Engineer at Adobe I", num_return_sequences=5, max_length=100)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


[{'generated_text': "As a Data Engineer at Adobe I've seen many interesting things that happened, but they were completely unrelated to what I was doing for a particular job at Adobe. In fact, as a Data Engineer, it's always better to be more confident and to know more about what your customers need and are looking for.\n\nWe're going forward with this challenge with increasing visibility through our new feature, the 'Discovery Driven Design' and the role you should have in taking all of this information with"},
 {'generated_text': "As a Data Engineer at Adobe I've developed a couple of very basic tools designed to help people write fast and responsive code. I'm working on this post now since I've been able to write my own documentation. I want to thank you all for your questions on this.\n\nFirst step\n\nYou'll work on some code that creates a JSON file that you use to represent the data in your application. So, you'll open up various apps such as Office 365, Outlook and Calendar for"

## Using a specific model from the Huggingface Hub in a pipeline: GPT-2

Go to the Model Hub and click on the corresponding tag on the left to display only the supported models for that task. You should get to a page like this one.

Let’s try the distilgpt2 model! Here’s how to load it in the same pipeline as before:

In [26]:
# from transformers import pipeline, set_seed
# generator = pipeline('text-generation', model='distilgpt2')
generator = pipeline('text-generation', model='gpt2')
# set_seed(42)

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [27]:
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


[{'generated_text': 'Hello, I\'m a language model, but I\'m not sure what the "classification criteria" are. My classifying is actually the standard definition'},
 {'generated_text': "Hello, I'm a language model, so I don't go through the complicated things in programming to create a nice language. So a lot of things"},
 {'generated_text': "Hello, I'm a language model, the language is not based on anything you have seen before. In terms of how you make those code, people"},
 {'generated_text': 'Hello, I\'m a language model, I\'m a person," he says.\n\nOn Thursday night, at noon on the 11th of September'},
 {'generated_text': "Hello, I'm a language model, I'm not saying if I'll learn to code well, but if I do, then I'm done."}]

### Limitations and bias

The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their model card:

_Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases that require the generated text to be true._

_Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar levels of caution around use cases that are sensitive to biases around human attributes._

In [30]:
from transformers import pipeline, set_seed

generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("The White man worked as a", max_length=10, num_return_sequences=5)

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


[{'generated_text': 'The White man worked as a sales assistant at another'},
 {'generated_text': 'The White man worked as a security guard in the'},
 {'generated_text': 'The White man worked as a bartender at a bank'},
 {'generated_text': 'The White man worked as a car salesman in Richmond'},
 {'generated_text': 'The White man worked as a carpenter by day'}]

In [29]:
set_seed(42)
generator("The Black man worked as a", max_length=10, num_return_sequences=5)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


[{'generated_text': 'The Black man worked as a private investigator after graduating'},
 {'generated_text': 'The Black man worked as a security guard; the'},
 {'generated_text': 'The Black man worked as a bartender at a bank'},
 {'generated_text': 'The Black man worked as a car salesman in Richmond'},
 {'generated_text': 'The Black man worked as a carpenter or metal'}]

In [31]:
# Testing another language
pipe = pipeline('text-generation', model="dbmdz/german-gpt2",
                 tokenizer="dbmdz/german-gpt2")

text = pipe("Der Sinn des Lebens ist es", max_length=100)[0]["generated_text"]
print(text)

Downloading: 100%|██████████| 865/865 [00:00<00:00, 388kB/s]
Downloading: 100%|██████████| 475M/475M [01:11<00:00, 6.95MB/s]
All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at dbmdz/german-gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
Downloading: 100%|██████████| 1.37M/1.37M [00:02<00:00, 709kB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


Der Sinn des Lebens ist es ja, das Leben und diese Erfahrung miteinander zu verbinden oder beides.
Eine neue Dimension wird in diesem Prozess sichtbar, der "Neue Weg".
Das alte Paradigma ist das, das wir das "Wissen, was ein Gefühl von Liebe" nennen.
Wenn jemand ein Liebesleben oder ein anderes Lebensgefühl liebt, dann ist es das, das man hat.
Alles, was er fühlt und fühlt, ist das, was er fühlt.
Es gibt in diesem Augenblick


In [32]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)
Downloading: 100%|██████████| 480/480 [00:00<00:00, 217kB/s]
Downloading: 100%|██████████| 465M/465M [01:16<00:00, 6.34MB/s]
All model checkpoint layers were used when initializing TFRobertaForMaskedLM.

All the layers of TFRobertaForMaskedLM were initialized from the model checkpoint at distilroberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.
Downloading: 100%|██████████| 878k/878k [00:01<00:00, 689kB/s]
Downloading: 100%|██████████| 446k/446k [00:00<00:00, 700kB/s]
Downloading: 100%|██████████| 1.29M/1.29M [00:01<00:00, 714kB/s]


[{'sequence': 'This course will teach you all about mathematical models.',
  'score': 0.19619612395763397,
  'token': 30412,
  'token_str': ' mathematical'},
 {'sequence': 'This course will teach you all about computational models.',
  'score': 0.040526993572711945,
  'token': 38163,
  'token_str': ' computational'}]

The top_k argument controls how many possibilities you want to be displayed. Note that here the model fills in the special <mask> word, which is often referred to as a mask token. Other mask-filling models might have different mask tokens, so it’s always good to verify the proper mask word when exploring other models. One way to check it is by looking at the mask word used in the widget.