# Transformers, what can they do?

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
%pip install datasets evaluate transformers[sentencepiece]

Defaulting to user installation because normal site-packages is not writeable
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting protobuf (from transformers[sentencepiece])
  Downloading protobuf-5.28.3-cp310-abi3-win_amd64.whl.metadata (592 bytes)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
   ---------------------------------------- 0.0/84.0 kB ? eta -:--:--
   ---------------------------------------- 0.0/84.0 kB ? eta -:--:--
   ---- ----------------------------------- 10.2/84.0 kB ? eta -:--:--
   -------------- ------------------------- 30.7/84.0 kB 330.3 kB/s eta 0:00:01
   -------------- ------------------------- 30.7/84.0 kB 330.3 kB/s eta 0:00:01
   ----------------------------- ---------- 61.4/84.0 kB 409.6 kB/s eta 0:00:01
   ---------------------------------------- 84.0/84.0 kB 472.9 kB/s eta 0:00:00
Downloading protobuf-5.28.3-cp310-abi3-win_amd64.whl (431 kB)
   ---------------------------------------- 0.0/431.5 kB ? eta

In [4]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598049521446228}]

In [None]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I love this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'POSITIVE', 'score': 0.9998788833618164}]

In [7]:
classifier(
    ["This a good reasen to do Maschien learning", "This is a bad reason to do Machine Learning"]
)

[{'label': 'POSITIVE', 'score': 0.9996041655540466},
 {'label': 'NEGATIVE', 'score': 0.9998123049736023}]

In [13]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about learing hot to climb mountains.",
    candidate_labels=["education", "politics", "business","sports","entertainment","technology","science","health"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'This is a course about learing hot to climb mountains.',
 'labels': ['education',
  'sports',
  'health',
  'business',
  'science',
  'entertainment',
  'technology',
  'politics'],
 'scores': [0.38050493597984314,
  0.25120997428894043,
  0.09396220743656158,
  0.07361197471618652,
  0.06646586209535599,
  0.057182252407073975,
  0.05106557533144951,
  0.025997214019298553]}

In [14]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to build an SQL database using Node.js and React. Your first step will be to build an R database using R. This is in an example of the web app that you will try to write when'}]

In [20]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "This is ",
    max_length=40,
    num_return_sequences=2,
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'This is Â\u2005´ Â-\u2005Â-* I can see it’s just another random number.\nThere is no way to check the text of that text,'},
 {'generated_text': 'This is _______. Not only will it create the new page, but it will show all the info about the changes that have been happening in the thread. But it can also create something interesting to'}]

In [25]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=5)

No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.19198362529277802,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.042091839015483856,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'},
 {'score': 0.036024145781993866,
  'token': 27930,
  'token_str': ' predictive',
  'sequence': 'This course will teach you all about predictive models.'},
 {'score': 0.029781164601445198,
  'token': 745,
  'token_str': ' building',
  'sequence': 'This course will teach you all about building models.'},
 {'score': 0.02392693981528282,
  'token': 3034,
  'token_str': ' computer',
  'sequence': 'This course will teach you all about computer models.'}]

In [29]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

ner("My name is Joel and I work at New Voice in Zürich.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER',
  'score': 0.9990613,
  'word': 'Joel',
  'start': 11,
  'end': 15},
 {'entity_group': 'ORG',
  'score': 0.9974774,
  'word': 'New Voice',
  'start': 30,
  'end': 39},
 {'entity_group': 'LOC',
  'score': 0.9942023,
  'word': 'Zürich',
  'start': 43,
  'end': 49}]

In [32]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.6949759125709534, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

In [33]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


pytorch_model.bin:   8%|7         | 94.4M/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

In [36]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    What does this mean?

    Let's look at the opening statement in more detail:
    Contents

    1 What does this mean?
    2 How does Wikidata work?
        2.1 The Wikidata repository
        2.2 Working with Wikidata
    3 Where to get started
    4 How can I contribute?
    5 See also

    Free. The data in Wikidata is published under the Creative Commons Public Domain Dedication 1.0, allowing the reuse of the data in many different scenarios. You can copy, modify, distribute and perform the data, even for commercial purposes, without asking for permission.
    Collaborative. Data is entered and maintained by Wikidata editors, who decide on the rules of content creation and management. Automated bots also enter data into Wikidata.
    Multilingual. Editing, consuming, browsing, and reusing the data is fully multilingual. Data entered in any language is immediately available in all other languages. Editing in any language is possible and encouraged.
    A secondary knowledge base. Wikidata records not just statements, but also their sources, and connections to other databases. This reflects the diversity of knowledge available and supports the notion of verifiability.
    Collecting structured data. Imposing a high degree of structured organization allows for easy reuse of data by Wikimedia projects and third parties, and enables computers to process and “understand” it.
    Support for Wikimedia wikis. Wikidata assists Wikipedia with more easily maintainable information boxes and links to other languages, thus reducing editing workload while improving quality. Updates in one language are made available to all other languages.
    Anyone in the world. Anyone can use data from Wikidata to build their applications and services.

    How does Wikidata work?
    This diagram of a Wikidata item shows you the most important terms in Wikidata.

    Wikidata is a central storage repository that can be accessed by others, such as the wikis maintained by the Wikimedia Foundation. Content loaded dynamically from Wikidata does not need to be maintained in each individual wiki project. For example, statistics, dates, locations, and other common data can be centralized in Wikidata.
    The Wikidata repository
    Items and their data are interconnected.

    The Wikidata repository consists mainly of items, each one having a label, a description and any number of aliases. Items are uniquely identified by a Q followed by a number, such as Douglas Adams (Q42).

    Statements describe detailed characteristics of an Item and consist of a property and a value. Properties in Wikidata have a P followed by a number, such as with educated at (P69).

    For a person, you can add a property to specify where they were educated, by specifying a value for a school. For buildings, you can assign geographic coordinates properties by specifying longitude and latitude values. Properties can also link to external databases. A property that links an item to an external database, such as an authority control database used by libraries and archives, is called an identifier. Special Sitelinks connect an item to corresponding content on client wikis, such as Wikipedia, Wikibooks or Wikiquote.

    All this information can be displayed in any language, even if the data originated in a different language. When accessing these values, client wikis will show the most up-to-date data.
    Item 	Property 	Value
    Q42 	P69 	Q691283
    Douglas Adams 	educated at 	St John's College
    Working with Wikidata

    There are a number of ways to access Wikidata using built-in tools, external tools, or programming interfaces.

    Wikidata Query and Reasonator are some of the popular tools to search for and examine Wikidata items. The tools page has an extensive list of interesting projects to explore.
    You can retrieve all data programmatically using different APIs and service.
    Client wikis can access data for their pages using a Lua Scribunto interface.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': ' The data in Wikidata is published under the Creative Commons Public Domain Dedication 1.0, allowing the reuse of the data in many different scenarios . Editing, consuming, browsing, and reusing the data is fully multilingual. Editing in any language is possible and encouraged .'}]

In [None]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

[{'translation_text': 'This course is produced by Hugging Face.'}]

In [35]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-de-en")
translator("Programieren ist für mich wie zauberei.")

[{'translation_text': 'For me, programming is like magic.'}]