In [10]:
# Uncomment and run this cell if you're on Colab or Kaggle
!git clone https://github.com/nlp-with-transformers/notebooks.git
%cd notebooks
from install import *
install_requirements()

Cloning into 'notebooks'...
remote: Enumerating objects: 530, done.[K
remote: Counting objects: 100% (210/210), done.[K
remote: Compressing objects: 100% (48/48), done.[K
remote: Total 530 (delta 184), reused 162 (delta 162), pack-reused 320 (from 2)[K
Receiving objects: 100% (530/530), 28.52 MiB | 23.52 MiB/s, done.
Resolving deltas: 100% (253/253), done.
/content/notebooks/notebooks
⏳ Installing base requirements ...
✅ Base requirements installed!
⏳ Installing Git LFS ...
✅ Git LFS installed!


This code sets up your Colab/Kaggle environment by downloading the course repo and installing all the necessary libraries.

In [11]:
#hide
from utils import *
setup_chapter()

No GPU was detected! This notebook can be *very* slow without a GPU 🐢
Go to Runtime > Change runtime type and select a GPU hardware accelerator.
Using transformers v4.16.2
Using datasets v1.16.1


Upgrade libraries

In [16]:
!pip install -U transformers huggingface_hub accelerate

Collecting transformers
  Downloading transformers-4.56.0-py3-none-any.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.1/40.1 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate
  Downloading accelerate-1.10.1-py3-none-any.whl.metadata (19 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Downloading tokenizers-0.22.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Downloading transformers-4.56.0-py3-none-any.whl (11.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m36.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading accelerate-1.10.1-py3-none-any.whl (374 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.9/374.9 kB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.22.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 M

This code hides setup details, imports helper tools, and prepares the notebook environment for the chapter.

## A Tour of Transformer Applications

In [1]:
text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

This code stores a long text (the complaint letter) into the variable text, so it can be used later (for NLP tasks like sentiment analysis, summarization, or translation).

### Text Classification

In [2]:
#hide_output
from transformers import pipeline

classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


from transformers import pipeline

* Imports the pipeline function from Hugging Face’s transformers library.

* This function makes it easy to use pre-trained models.

classifier = pipeline("text-classification")

* Creates a pipeline object for text classification (default = sentiment analysis).

* Downloads a pre-trained model if not already available.

* Saves it in the variable classifier, so you can now use it to analyze text.

In [3]:
import pandas as pd

outputs = classifier(text)
pd.DataFrame(outputs)

Unnamed: 0,label,score
0,NEGATIVE,0.901546


import pandas as pd

* Loads the pandas library (for working with tables and DataFrames).

outputs = classifier(text)

* Sends the variable text (your Amazon complaint letter) into the classifier pipeline.

* The classifier predicts the sentiment (or labels).

* The result is stored in outputs.

* outputs is usually a list of dictionaries, e.g.

pd.DataFrame(outputs)

* Converts the list of dictionaries into a pandas DataFrame.

* This makes the results easier to read, like a table:

Result
* NEGATIVE → The text is negative.

* 0.90 → The model is 90% confident.

* It’s a complaint letter.

### Named Entity Recognition 🆘

Named entities = real-world objects in text (e.g., people, places, products, organizations).

NER = finding and labeling these entities in text.

In [4]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.87901,Amazon,5,11
1,MISC,0.990859,Optimus Prime,36,49
2,LOC,0.999755,Germany,90,97
3,MISC,0.55657,Mega,208,212
4,PER,0.590256,##tron,212,216
5,ORG,0.669692,Decept,253,259
6,MISC,0.498349,##icons,259,264
7,MISC,0.775362,Megatron,350,358
8,MISC,0.987854,Optimus Prime,367,380
9,PER,0.812096,Bumblebee,502,511


pipeline("ner", aggregation_strategy="simple")

* Loads a pre-trained NER model.

* aggregation_strategy="simple" → combines tokens that belong to the same entity (so “New York” = one entity, not two words).

outputs = ner_tagger(text)

* Runs NER on your Amazon complaint letter.

* Finds entities like Amazon, Optimus Prime, Germany, Megatron, Bumblebee.

pd.DataFrame(outputs)

* Shows the results in a table (entity, label, confidence score).

Amazon → recognized as an organization (ORG).

Optimus Prime → recognized as miscellaneous (MISC) (a product/character).

Germany → recognized as a location (LOC).

Megatron, Decepticons, Bumblebee → also detected as entities (mostly MISC or PER = person).

**In short: The model found companies, products/characters, places, and people in the letter.**

### Question Answering

In [6]:
reader = pipeline("question-answering")
question = "What does the customer want?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])

Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Unnamed: 0,score,start,end,answer
0,0.631292,335,358,an exchange of Megatron


pipeline("question-answering")

* Loads a pre-trained Q&A model.

question = "What does the customer want?"

* Sets the question you want the model to answer.

reader(question=question, context=text)

* Gives the model both the question and the context (the Amazon complaint letter).

* The model searches the text and finds the best answer.

pd.DataFrame([outputs])

* Displays the answer in a table with columns like answer, score, start, end.


It also returns the character indices (start and end positions of the answer in the passage).

This is called **extractive QA** because the answer is taken directly from the text, not generated.

### Summarization 🆘

Task: Take a long text as input and create a short version that keeps the most important facts.

This is harder than earlier tasks (like classification or QA).

Why? Because the model must generate new, coherent sentences, not just extract words from the text.

In [5]:
summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cpu
Your min_length=56 must be inferior than your max_length=45.


 Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead.


pipeline("summarization")

* Loads a pre-trained summarization model.

summarizer(text, ...)

* Runs the summarizer on your text (the Amazon complaint).

* Parameters:

  * max_length=45 → The summary can be at most 45 tokens long.

  * clean_up_tokenization_spaces=True → Removes extra spaces from the output so the text looks clean.

print(outputs[0]['summary_text'])

* Prints only the generated summary text.

### Translation

In [None]:
translator = pipeline("translation_en_to_de",
                      model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus
Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket öffnete,
entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von
Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich
hoffe, Sie können mein Dilemma verstehen. Um das Problem zu lösen, Ich fordere
einen Austausch von Megatron für die Optimus Prime Figur habe ich bestellt.
Anbei sind Kopien meiner Aufzeichnungen über diesen Kauf. Ich erwarte, bald von
Ihnen zu hören. Aufrichtig, Bumblebee.


pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")

* Loads a translation pipeline.

* Uses the model Helsinki-NLP/opus-mt-en-de (English → German).

translator(text, ...)

* Runs the translator on your variable text (the Amazon complaint letter).

* Parameters:

  * clean_up_tokenization_spaces=True → removes extra spaces in the output.

  * min_length=100 → makes sure the translation is at least 100 tokens long.

print(outputs[0]['translation_text'])

* Prints the actual translated text (English → German).

### Text Generation

In [None]:
#hide
from transformers import set_seed
set_seed(42) # Set the seed to get reproducible results

hide

* Special comment to hide the cell’s output in notebooks (still runs in the background).

from transformers import set_seed

* Imports the set_seed function from Hugging Face’s Transformers library.

set_seed(42)

* Fixes the random seed to 42.

* This makes results reproducible → you (and others) get the same outputs every time you run the model, instead of slightly different ones.

In [None]:
generator = pipeline("text-generation")
response = "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=200)
print(outputs[0]['generated_text'])

Dear Amazon, last week I ordered an Optimus Prime action figure from your online
store in Germany. Unfortunately, when I opened the package, I discovered to my
horror that I had been sent an action figure of Megatron instead! As a lifelong
enemy of the Decepticons, I hope you can understand my dilemma. To resolve the
issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered.
Enclosed are copies of my records concerning this purchase. I expect to hear
from you soon. Sincerely, Bumblebee.

Customer service response:
Dear Bumblebee, I am sorry to hear that your order was mixed up. The order was
completely mislabeled, which is very common in our online store, but I can
appreciate it because it was my understanding from this site and our customer
service of the previous day that your order was not made correct in our mind and
that we are in a process of resolving this matter. We can assure you that your
order


pipeline("text-generation")

* Loads a pre-trained text generation model (default = GPT-2).

response = ...

* Creates a short customer service reply.

prompt = text + ...

* Combines the original complaint letter (text) with a new section:

generator(prompt, max_length=200)

* Feeds the combined text into the model.

* Asks it to generate up to 200 tokens of new text continuing the response.

print(outputs[0]['generated_text'])

* Prints the entire generated output (complaint letter + AI-generated response).

### Image generation

Stay on CPU (remove fp16)

In [None]:
import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float32,
).to("cpu")

pipe.enable_attention_slicing()

prompt = "a photograph of an astronaut riding a horse"
image = pipe(prompt).images[0]
image.show()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 15 files:   0%|          | 0/15 [00:00<?, ?it/s]

safety_checker/model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

unet/diffusion_pytorch_model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

vae/diffusion_pytorch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

text_encoder/model.safetensors:   0%|          | 0.00/492M [00:00<?, ?B/s]

Run on GPU (recommended)

In [None]:
import torch
from diffusers import StableDiffusionPipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,   # keep fp16
    variant="fp16",
).to(device)

prompt = "a photograph of an astronaut riding a horse"
image = pipe(prompt).images[0]
image.show()

In [4]:
!pip install nbstripout
!nbstripout --install

Collecting nbstripout
  Downloading nbstripout-0.8.1-py2.py3-none-any.whl.metadata (19 kB)
Downloading nbstripout-0.8.1-py2.py3-none-any.whl (16 kB)
Installing collected packages: nbstripout
Successfully installed nbstripout-0.8.1
fatal: --local can only be used inside a git repository
Installation failed: not a git repository!
