# Transformers - Text Models Pipelines



Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.5-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.5


In [2]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("This is so awesome")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9998668432235718}]

In [3]:
classifier(
    ["I enjoy what am learning so far.", "I hate this regime so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9998506307601929},
 {'label': 'NEGATIVE', 'score': 0.9993481040000916}]

In [4]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445994257926941, 0.11197380721569061, 0.04342673346400261]}

In [8]:
classifier(
    "More young people should come out and vie for elective in 2027",
    candidate_labels=["religion", "politics", "arts"],
)

{'sequence': 'More young people should come out and vie for elective in 2027',
 'labels': ['politics', 'arts', 'religion'],
 'scores': [0.9610762000083923, 0.03124106675386429, 0.007682805880904198]}

In [9]:
from transformers import pipeline

generator = pipeline('text-generation')
generator('My name is Brian and am studying NLP hoping to')

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "My name is Brian and am studying NLP hoping to become a professional audio engineer.\n\nI am looking for a job that will allow me to take on the task of recording and editing an audio file for all audio players. I have a BA in audio engineering and want to be able to use the tools I have to help others.\n\nI am a strong advocate for open source projects and I believe the way to truly have a great audio experience is to create something that works for you.\n\nI am looking for a professional audio engineer to work in a variety of roles including music production, editing video, recording and editing music.\n\nI'm an avid listener and hope to make it in the next five years. I have a passion for music and will always be looking for new ways to make my music more accessible and accessible to the public.\n\nI am happy to let you know that I am a musician and am looking for a job that will allow me to share my passion with you.\n\nI am also looking for a job that will all

In [10]:
#since text generation is random, we dont expect same result from identical prompt
#we can also control number of sequence and max length
generator('My name is Brian and am studying NLP hoping to',
          num_return_sequences = 2,
          max_length = 100,)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'My name is Brian and am studying NLP hoping to do some research on the future of this genre. I also love the way this genre has been created. There seems to be a growing interest in the field of alternative music. I think there is a huge opportunity for this type of music to influence people to create their own music and to develop their own community.\n\nHow did you come up with the term "Alternative Music"? Is there anything that makes you think of alternative music as a genre that isn\'t really so much a genre as a lifestyle choice?\n\nI think that the term "Alternative Music" has it\'s roots in the days of the Beatles and John Lennon. The Beatles and John Lennon were instrumental in starting this genre in the U.S. The Beatles and John Lennon were instrumental in the formation of this genre throughout history. I think there is a huge opportunity for this type of music to influence people to create their own music and to develop their own community.\n\nHow did yo

In [11]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "I enjoy studying Natural Language Processing",
    max_length=30,
    num_return_sequences=2,
)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=30) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'I enjoy studying Natural Language Processing (NLP). I am interested in learning about my favorite language, and I want to share it with you. I recently published my book, Language In the World: How the World Became the New Language, by Alex G. G. Blomberg (Princeton University Press, 2009).\n\n\n\nI’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’'},
 {'generated_text': 'I enjoy studying Natural Language Processing and Language Development.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n

In [12]:
from transformers import pipeline

unmasker = pipeline('fill-mask')
unmasker('In the Hugging Face Natural <mask> Processing course, am learning about tokenization and positional encoding', top_k = 3)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'score': 0.504921019077301,
  'token': 22205,
  'token_str': ' Language',
  'sequence': 'In the Hugging Face Natural Language Processing course, am learning about tokenization and positional encoding'},
 {'score': 0.16665366291999817,
  'token': 27695,
  'token_str': ' Signal',
  'sequence': 'In the Hugging Face Natural Signal Processing course, am learning about tokenization and positional encoding'},
 {'score': 0.06825456023216248,
  'token': 2960,
  'token_str': ' Image',
  'sequence': 'In the Hugging Face Natural Image Processing course, am learning about tokenization and positional encoding'}]

In [13]:
from transformers import pipeline

ner = pipeline('ner', grouped_entities = True)
ner('My name is Okoth and I studied at Egerton University in Nakuru')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'entity_group': 'PER',
  'score': np.float32(0.99699175),
  'word': 'Okoth',
  'start': 11,
  'end': 16},
 {'entity_group': 'ORG',
  'score': np.float32(0.98712695),
  'word': 'Egerton University',
  'start': 34,
  'end': 52},
 {'entity_group': 'LOC',
  'score': np.float32(0.9868085),
  'word': 'Nakuru',
  'start': 56,
  'end': 62}]

In [14]:
from transformers import pipeline

question_answerer = pipeline('question-answering')
question_answerer(
    question = 'Where did I study?',
    context = 'My name is Okoth and I studied at Egerton University in Nakuru',
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


{'score': 0.8591462612275791,
 'start': 34,
 'end': 52,
 'answer': 'Egerton University'}

In [15]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    This is a critical moment for Kenyan youth to embark on self-study in Natural Language Processing (NLP), as the global and local digital landscapes are rapidly evolving. Across the world, industries are shifting toward AI-driven solutions, and NLP sits at the core of this transformation—powering chatbots, translation systems, voice assistants, and information retrieval tools. In Kenya, where the digital economy is expanding and technology adoption is high, the demand for localized AI solutions is growing quickly. NLP offers a unique opportunity for youth to build systems that understand and process Kiswahili, Sheng, and other African languages, filling a gap often overlooked by global tech companies. Moreover, with the rise of remote work and open-source communities, Kenyan youth can contribute to cutting-edge projects, gain visibility, and secure global opportunities without leaving the country. Self-studying NLP also fosters innovation in local sectors like fintech, agriculture, education, and healthcare—where AI tools can solve context-specific challenges. As access to affordable internet, open datasets, and free learning resources continues to improve, the barriers to entry are lower than ever. By starting now, Kenyan youth can position themselves not just as consumers of AI, but as active creators shaping Africa’s digital future.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'summary_text': ' This is a critical moment for Kenyan youth to embark on self-study in Natural Language Processing . NLP offers a unique opportunity for youth to build systems that understand and process Kiswahili, Sheng, and other African languages . Self-studying NLP also fosters innovation in local sectors like fintech, agriculture, education, and healthcare .'}]

In [16]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("j'apprends actuellement le traitement du langage naturel (NLP).")

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'translation_text': 'I am currently learning natural language (NLP) treatment.'}]