<a href="https://colab.research.google.com/github/Assylbek15/huggingface-nlp-course/blob/main/chapter01_introduction/transformers_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers, what can they do?

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

In [4]:
# The `pipeline("sentiment-analysis")` line automatically:
# - Loads a pretrained model (usually `distilbert-base-uncased-finetuned-sst-2-english`)
# - Sets up the tokenizer and model for inference
# - Preprocesses the input, runs it through the model, and postprocesses the result
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("No more buses, no more waiting — just me and my Lamborghini in the fast lane of life.")
# The input -> any sentence
# Output -> (positive/negative) & confidence score.


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.5876696705818176}]

In [5]:
classifier("I’m so happy I could cry.")

[{'label': 'POSITIVE', 'score': 0.9995484948158264}]

In [7]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!", "It’s not perfect, but I can live with it.","I’ve had worse days, but this one’s trying hard."]
)

[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455},
 {'label': 'POSITIVE', 'score': 0.9960353970527649},
 {'label': 'NEGATIVE', 'score': 0.8492825031280518}]

In [9]:
#Trying ambigious statements
classifier("Wow, you really did that.")
# Sarcastic or impressed?

[{'label': 'POSITIVE', 'score': 0.9997573494911194}]

In [12]:
# ➤ model = multi-label classifier (e.g., bart-large-mnli)
# ➤ input = unlabeled text
# ➤ candidate_labels = custom label set
# ➤ output → ranked labels by relevance + scores

from transformers import pipeline

classifier(
    "We're experimenting with reinforcement learning algorithms to optimize robotic movement.",
    candidate_labels=["machine learning", "robotics", "healthcare"],
)

{'sequence': "We're experimenting with reinforcement learning algorithms to optimize robotic movement.",
 'labels': ['machine learning', 'robotics', 'healthcare'],
 'scores': [0.6172025799751282, 0.37971267104148865, 0.0030847135931253433]}

In [14]:
# ➤ text-generation → auto-completes a given prompt
# ➤ default model = GPT-2 (unless specified)
# ➤ output = generated text (can control length, randomness)
# ➤ use for → creative writing, code gen, dialogue, etc.
from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to manage and manage your time using the Microsoft Office productivity application, which makes for the perfect example for how to manage and manage your time using the Microsoft Office productivity application.'}]

In [15]:
# ➤ hyperparameters:
#     ↳ max_length = 30 → total number of tokens (prompt + generated)
#     ↳ num_return_sequences = 2 → generate 2 different completions
# ➤ other optional params (not used here):
#     ↳ temperature → controls randomness (↑ temp = more creative)
#     ↳ top_k / top_p → controls sampling diversity
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use a single-touch system so that the user can control the layout without any configuration or limitations so'},
 {'generated_text': "In this course, we will teach you how to do things that aren't your own, but what you can do with them.\u200d\n\n"}]

In [23]:
#  fill-mask → predicts masked word in a sentence
#  uses masked language models (e.g., BERT)
#  <mask> = special token the model will fill
#  top_k = number of predictions returned (ranked by confidence)

from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
unmasker("This summer, I hope to land a [MASK] internship at a tech company.", top_k=2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.6539412140846252,
  'token': 2621,
  'token_str': 'summer',
  'sequence': 'this summer, i hope to land a summer internship at a tech company.'},
 {'score': 0.020114146173000336,
  'token': 21115,
  'token_str': 'lucrative',
  'sequence': 'this summer, i hope to land a lucrative internship at a tech company.'}]

In [24]:
# ➤ ner = Named Entity Recognition → detects entities in text
# ➤ grouped_entities=True → combines subword tokens into full entities
# ➤ output = list of detected entities with:
#     ↳ entity_group (e.g., PER, ORG, LOC)
#     ↳ word (actual entity text)
#     ↳ score (confidence), start/end (char positions)

from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("Barack Obama studied at Harvard University, worked in Washington, and now lives in Chicago with Michelle Obama.")


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'entity_group': 'PER',
  'score': np.float32(0.9991107),
  'word': 'Barack Obama',
  'start': 0,
  'end': 12},
 {'entity_group': 'ORG',
  'score': np.float32(0.951369),
  'word': 'Harvard University',
  'start': 24,
  'end': 42},
 {'entity_group': 'LOC',
  'score': np.float32(0.99944824),
  'word': 'Washington',
  'start': 54,
  'end': 64},
 {'entity_group': 'LOC',
  'score': np.float32(0.999298),
  'word': 'Chicago',
  'start': 83,
  'end': 90},
 {'entity_group': 'PER',
  'score': np.float32(0.99817735),
  'word': 'Michelle Obama',
  'start': 96,
  'end': 110}]

In [28]:
# ➤ question-answering → extract answer from given context
# ➤ input = question + context paragraph
# ➤ model scans context and returns:
#     ↳ answer (text span)
#     ↳ score (confidence), start/end (char positions)

from transformers import pipeline

question_answerer = pipeline("question-answering")

question_answerer(
    question="Which university did Alice graduate from?",
    context="Alice completed her studies in computer science at Stanford University in 2022.",
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'score': 0.5476713180541992,
 'start': 51,
 'end': 70,
 'answer': 'Stanford University'}

In [32]:
# summarizes text
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    Raskolnikov had a terrible dream. He was a little boy, walking with his father through a desolate village.
    The sky was low and grey, and everything was silent. Suddenly, a drunken group of peasants appeared, laughing
    and shouting. They had a cart with an old, broken-down mare hitched to it. Laughing cruelly, they tried to
    force the exhausted animal to run. They whipped it, beat it with crowbars, screamed at it as it staggered and fell.
    The boy sobbed and begged them to stop, but they only laughed louder. The horse was beaten to death, its eyes
    wide with terror and blood on its flanks, while the little boy screamed and cried, powerless. Then he woke up,
    drenched in sweat, heart pounding, the scene burned into his mind.

    This dream haunted him. It was more than just a nightmare; it was a symbol, something deeper. He paced the floor,
    wrestling with the images and the weight of what he had done. Murder was not the clean, rational act he had imagined.
    It was chaos. It left behind blood, terror, and voices in the night. The idea that he was above morality,
    that he could commit a crime for a higher purpose — it now felt hollow. And yet, he could not let go of it.
    He clung to the belief that his theory had merit, even as guilt gnawed at the edges of his soul. His intellect
    and his conscience were at war, and he was the battlefield.
    """
)



No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'summary_text': ' Raskolnikov had a terrible dream about beating a broken-down mare to death . He woke up, drenched in sweat, heart pounding, the scene burned into his mind . The idea that he was above morality, that he could commit a crime for a higher purpose — it now felt hollow .'}]

In [33]:
# translates
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

Device set to use cpu


[{'translation_text': 'This course is produced by Hugging Face.'}]