# Notebook: Transformer-Based Models Part 2

*Prof. Dr. M. Kurpicz-Briki, Bachelor in Data Engineering, Bern University of Applied Sciences*

In [None]:
!pip install transformers torch --quiet

In [None]:
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import torch

## Exercise 1: BERT-style masked language modeling




In [None]:
fill_mask = pipeline(
    "fill-mask",
    model="distilbert-base-uncased"
)

sentences = [
    "The movie was [MASK] but the acting saved it.",
    "Without coffee, I can't [MASK] in the morning.",
    "If it rains tomorrow, the [MASK] will be cancelled."
]

for s in sentences:
    print("\nSentence:", s)
    for pred in fill_mask(s):
        print(f"  {pred['sequence']}  (p={pred['score']:.3f})")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu



Sentence: The movie was [MASK] but the acting saved it.
  the movie was banned but the acting saved it.  (p=0.090)
  the movie was unsuccessful but the acting saved it.  (p=0.064)
  the movie was successful but the acting saved it.  (p=0.033)
  the movie was filmed but the acting saved it.  (p=0.030)
  the movie was lost but the acting saved it.  (p=0.028)

Sentence: Without coffee, I can't [MASK] in the morning.
  without coffee, i can ' t sleep in the morning.  (p=0.620)
  without coffee, i can ' t stay in the morning.  (p=0.042)
  without coffee, i can ' t eat in the morning.  (p=0.039)
  without coffee, i can ' t work in the morning.  (p=0.034)
  without coffee, i can ' t walk in the morning.  (p=0.018)

Sentence: If it rains tomorrow, the [MASK] will be cancelled.
  if it rains tomorrow, the tournament will be cancelled.  (p=0.108)
  if it rains tomorrow, the game will be cancelled.  (p=0.096)
  if it rains tomorrow, the race will be cancelled.  (p=0.076)
  if it rains tomorrow, 

## Exercises

Look at the top predictions for each [MASK].

Do they make sense given both left and right context?

Try moving [MASK] to a different position in the sentence (e.g., mask the last word instead of a middle word).

How do the predictions change?

Craft one sentence where right context is crucial to disambiguate the word (e.g., "He put the turkey in the [MASK] and turned it to 200 degrees.").

# Exercise 2: GPT-style next-token prediction



In [None]:
## Simple Generation

from transformers import pipeline

gpt_gen = pipeline(
    "text-generation",
    model="distilgpt2"
)

prompt = "The movie was"
outputs = gpt_gen(prompt, max_new_tokens=10, do_sample=False)
print(outputs[0]["generated_text"])

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The movie was a hit in the United States and was a hit


## Exercises

Change the prompt to something like

"Without coffee, I can't"

"If it rains tomorrow, the"

Note that GPT continues only from the left context; it never "looks ahead".

In [None]:
# Inspect top-k next tokens (architecture insight)

tokenizer_gpt = AutoTokenizer.from_pretrained("distilgpt2")
model_gpt = AutoModelForCausalLM.from_pretrained("distilgpt2")

def show_next_token_predictions(prompt, k=10):
    inputs = tokenizer_gpt(prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model_gpt(**inputs)
    logits = outputs.logits  # [batch, seq_len, vocab_size]
    next_token_logits = logits[0, -1]  # last position
    probs = torch.softmax(next_token_logits, dim=-1)
    top_k = torch.topk(probs, k)

    print(f"\nPrompt: {prompt!r}")
    for idx, p in zip(top_k.indices, top_k.values):
        token_str = tokenizer_gpt.decode(idx)
        print(f"  {repr(token_str):>10s}  p={float(p):.3f}")

show_next_token_predictions("The movie was")
show_next_token_predictions("Without coffee, I can't")


Prompt: 'The movie was'
        ' a'  p=0.067
  ' released'  p=0.063
     ' made'  p=0.055
     ' shot'  p=0.054
      ' the'  p=0.027
  ' directed'  p=0.022
  ' originally'  p=0.021
    ' based'  p=0.020
   ' filmed'  p=0.016
  ' nominated'  p=0.016

Prompt: "Without coffee, I can't"
     ' wait'  p=0.089
  ' believe'  p=0.073
     ' help'  p=0.064
  ' remember'  p=0.038
    ' think'  p=0.035
  ' imagine'  p=0.034
      ' get'  p=0.034
     ' stop'  p=0.029
     ' even'  p=0.025
     ' tell'  p=0.021


## Exercises
Compare GPT's top next tokens to BERT's predictions for [MASK] in similar contexts.

How do they differ qualitatively?

Why does GPT not know what comes after the next token when predicting it?

# Exercise 3: Synthesis question

Discuss with your colleagues or note it down in your own words the following questions:

Explain how the training objective of BERT (MLM) vs GPT (autoregressive) leads to:

BERT being naturally better at "understand & classify / fill in a blank"

GPT being naturally better at "generate the next continuation of a partial sentence"