<a href="https://colab.research.google.com/github/PaulVH53/hugging_face_codes/blob/main/hugging_face_pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from transformers import pipeline

The sentiment-analysis pipeline classifies texts as positive or negative.

In [3]:
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [4]:
output = classifier("I've been waiting for a HuggingFace course my whole life.")
print(output)
# [{'label': 'POSITIVE', 'score': 0.9598049521446228}]

[{'label': 'POSITIVE', 'score': 0.9598048329353333}]


In [5]:
output = classifier("I've forgotten my cousin's birthday.")
print(output)
# [{'label': 'NEGATIVE', 'score': 0.9994685053825378}]

[{'label': 'NEGATIVE', 'score': 0.9994685053825378}]


In [6]:
output = classifier([
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!"
])
print(output)
# [{'label': 'POSITIVE', 'score': 0.9598048329353333}, {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

[{'label': 'POSITIVE', 'score': 0.9598048329353333}, {'label': 'NEGATIVE', 'score': 0.9994558691978455}]


The zero-shot-classification pipeline lets you select the labels for classification

In [8]:
classifier = pipeline("zero-shot-classification")

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [10]:
output = classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)
print(output)
# {'sequence': 'This is a course about the Transformers library',
#  'labels': ['education', 'business', 'politics'],
#  'scores': [0.8445991277694702, 0.11197404563426971, 0.043426841497421265]}

{'sequence': 'This is a course about the Transformers library', 'labels': ['education', 'business', 'politics'], 'scores': [0.8445991277694702, 0.11197404563426971, 0.043426841497421265]}


The text-generation pipeline uses an input prompt to generate text

In [12]:
generator = pipeline("text-generation")

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [13]:
generator("In this course, we will teach you how to")
# [{'generated_text': 'In this course, we will teach you how to use your hands to create a pattern with each step.
#   For each part of the pattern, we will take a simple line and make it bigger.
#   The smaller the line the more "bigger" there'}]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use your hands to create a pattern with each step. For each part of the pattern, we will take a simple line and make it bigger. The smaller the line the more "bigger" there'}]

In [14]:
generator("In this course, we will teach you how to")
# [{'generated_text': 'In this course, we will teach you how to build a virtual reality headset by watching our VR demo,
#  and demonstrate how you can build a game using a set of 3D modelling and animating components.'}]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to build a virtual reality headset by watching our VR demo, and demonstrate how you can build a game using a set of 3D modelling and animating components.'}]

Zero-shot SELECTRA: A zero-shot classifier based on SELECTRA

In [3]:
classifier = pipeline("zero-shot-classification",
                       model="Recognai/zeroshot_selectra_medium")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/163M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/337 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/387k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [5]:
output = classifier(
    "El autor se perfila, a los 50 años de su muerte, como uno de los grandes de su siglo",
    candidate_labels=["cultura", "sociedad", "economia", "salud", "deportes"],
    hypothesis_template="Este ejemplo es {}."
)

print(output)
# {'sequence': 'El autor se perfila, a los 50 años de su muerte, como uno de los grandes de su siglo',
#  'labels': ['sociedad', 'cultura', 'economia', 'salud', 'deportes'],
#  'scores': [0.6450009942054749, 0.16710814833641052, 0.08507563173770905, 0.0759846568107605, 0.026830561459064484]}

{'sequence': 'El autor se perfila, a los 50 años de su muerte, como uno de los grandes de su siglo', 'labels': ['sociedad', 'cultura', 'economia', 'salud', 'deportes'], 'scores': [0.6450009942054749, 0.16710814833641052, 0.08507563173770905, 0.0759846568107605, 0.026830561459064484]}


In [6]:
generator = pipeline("text-generation", model="distilgpt2")

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [7]:
output = generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)
print(output)
# [{'generated_text': 'In this course, we will teach you how to apply C-A-C or C-B-A-C-F-A-C'},
#  {'generated_text': 'In this course, we will teach you how to read the English alphabet with a few questions.\n\n\n
#    For the purpose of this course, we'}]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to apply C-A-C or C-B-A-C-F-A-C'}, {'generated_text': 'In this course, we will teach you how to read the English alphabet with a few questions.\n\n\nFor the purpose of this course, we'}]


In [8]:
output = generator(
    "In this course, we will teach you how to",
    max_length=50,
    num_return_sequences=3,
)
print(output)
# [{'generated_text': 'In this course, we will teach you how to convert the Java language into JavaScript
#    using C# (like in this program):\n\n\n\n\n\n(Note: If you use JavaScript, you will be redirected to
#    the HTML5 module to see'},
#    {'generated_text': "In this course, we will teach you how to successfully get around the technical problem
#    of distributed distribution, and how you can get involved if needed. Each course has unique goals, so make
#    sure you're on the right track. But here's the main"},
#    {'generated_text': 'In this course, we will teach you how to convert, visualize, and even build a
#    website,\u202a️”\n\n\n\nUsing the following tools for navigating the web:\n\u202a️ Inbound.js\n\u202a️'}]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to convert the Java language into JavaScript using C# (like in this program):\n\n\n\n\n\n(Note: If you use JavaScript, you will be redirected to the HTML5 module to see'}, {'generated_text': "In this course, we will teach you how to successfully get around the technical problem of distributed distribution, and how you can get involved if needed. Each course has unique goals, so make sure you're on the right track. But here's the main"}, {'generated_text': 'In this course, we will teach you how to convert, visualize, and even build a website,\u202a️”\n\n\n\nUsing the following tools for navigating the web:\n\u202a️ Inbound.js\n\u202a️'}]


In [9]:
unmasker = pipeline("fill-mask")
output = unmasker("This course will teach you all about <mask> models", top_k=2)
print(output)
# [{'score': 0.1963157057762146, 'token': 30412, 'token_str': ' mathematical',
#   'sequence': 'This course will teach you all about mathematical models'},
#    {'score': 0.044492196291685104, 'token': 745, 'token_str': ' building',
#    'sequence': 'This course will teach you all about building models'}]

No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.1963157057762146, 'token': 30412, 'token_str': ' mathematical', 'sequence': 'This course will teach you all about mathematical models'}, {'score': 0.044492196291685104, 'token': 745, 'token_str': ' building', 'sequence': 'This course will teach you all about building models'}]


The NER pipeline identifies entities such as persons, organizations or locations in a sentence.

In [11]:
ner = pipeline("ner", grouped_entities=True)
output = ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")
print(output)
# [{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18},
#  {'entity_group': 'ORG', 'score': 0.9796019, 'word': 'Hugging Face', 'start': 33, 'end': 45},
#  {'entity_group': 'LOC', 'score': 0.9932106, 'word': 'Brooklyn', 'start': 49, 'end': 57}]

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18}, {'entity_group': 'ORG', 'score': 0.9796019, 'word': 'Hugging Face', 'start': 33, 'end': 45}, {'entity_group': 'LOC', 'score': 0.9932106, 'word': 'Brooklyn', 'start': 49, 'end': 57}]


The question-answering pipeline extracts answers to a question from a given context.

In [12]:
question_answerer = pipeline("question-answering")
output = question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn.",
)
print(output)
# {'score': 0.6385912299156189, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.6385912299156189, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}


The summarization pipeline creates summaries of long texts.

In [13]:
summarizer = pipeline("summarization")
output = summarizer("""
    The Industrial Revolution was a period of major economic, technological, and social change that began in the late 18th century and lasted through the early 19th century. It marked a transition from agrarian economies to industrialized ones, characterized by the widespread adoption of industrial methods of production.

During this time, there were significant advancements in machinery, transportation, and communication, leading to increased productivity and economic growth. Inventions such as the steam engine, textile machinery, and the telegraph revolutionized industries and transformed the way people lived and worked.

The Industrial Revolution had profound effects on society, including urbanization, the rise of the factory system, and the emergence of a new middle class. It also brought about social and economic inequalities, as well as environmental challenges due to pollution and resource depletion.

Overall, the Industrial Revolution was a period of rapid change and innovation that laid the foundation for modern industrial societies.

""")
print(output)
# [{'summary_text': ' The Industrial Revolution was a period of major economic, technological, and social change that
#   began in the late 18th century . It marked a transition from agrarian economies to industrialized ones . Inventions
#   such as the steam engine, textile machinery, and the telegraph revolutionized industries and transformed the way
#   people lived and worked .'}]

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' The Industrial Revolution was a period of major economic, technological, and social change that began in the late 18th century . It marked a transition from agrarian economies to industrialized ones . Inventions such as the steam engine, textile machinery, and the telegraph revolutionized industries and transformed the way people lived and worked .'}]


The translation pipeline translates text from one language to another

In [15]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
output = translator("Ce cours est produit par Hugging Face.")
print(output)
# [{'translation_text': 'This course is produced by Hugging Face.'}]

[{'translation_text': 'This course is produced by Hugging Face.'}]
