<a href="https://colab.research.google.com/github/BKBKlaassen/Gr8_ModelsForLanguageProcessing_assignments/blob/main/Group_8_Assignment_4_2026_student_b_version.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Assignment 4

In this assignment, you are asked to work with pretrained language models.

# <font color="red">Contributions</font>

Group number: 8

Group members: Bjorn Klaassen, Noah de Jonge

Who contributed to which exercises (you don't need to be very detailed):

In this assignment, you will work with pretrained language models. While they are small by modern standards, the computation required in the exercises below can be sped up if you use GPU for it. In the menu on Colab, go to "Runtime > Change runtime type" to enable GPU device (designated as `'cuda'` in the code).

Google Colab offers limited GPU resources, which are sufficient for doing the entire assignment. However, if you debug your code by repeatedly running large amounts of GPU computation, you may run out of the GPU allocation provided by Google. One thing you should learn from this assignment is to use GPU compute strategically, debugging without a GPU or with reduced data or computation. (GPU resources are just as scarce in the 'real world', outside of the simple tasks we do in class: for example, there are open weight language models with advanced capabilities, such as DeepSeek-R1, that theoretically anyone can run, but in practice the hardware requirements are prohibitive.)

If you do run out of GPU allocation on Colab, you can usually continue working in a different Google account as a workaround.

To start the assignment, import prerequisite packages:

In [None]:
import torch
import numpy as np
from tqdm.notebook import trange, tqdm
import nltk,sklearn
from sklearn.model_selection import train_test_split

In [None]:
import pandas as pd
import collections, itertools
import more_itertools

#4.1 BERT-like Model for Classification

##4.1.1 SICK dataset

In Assignment 2, we did entailment (hypernymy) classification on the basis of word vectors. Now we can do a similar experiment for sentence embeddings. Start by downloading the SICK dataset:

In [None]:
!wget https://zenodo.org/record/2787612/files/SICK.zip?download=1 -O SICK.zip
!unzip SICK.zip
!rm SICK.zip

In [None]:
!head SICK.txt

In [None]:
import pandas as pd
sick_df = pd.read_csv('SICK.txt', sep='\t')

You can inspect the first few data entries of the SICK dataset. You will see sentence A and sentence B and the entailment_labels, indicating whether sentence A entails sentence B.

In [None]:
sick_df

Read the train and test data from the file. Be sure to include in the training data all sentence pairs marked as "train" or "trial" in the SICK.txt file, and in the test data all sentence pairs marked as "test". As labels, use values from the `entailment_label` column in the dataset.

In [None]:
sick_train_examples=[]
sick_test_examples=[]
sick_train_labels=[]
sick_test_labels=[]

for i in range(len(sick_df)):
  if(sick_df.SemEval_set[i] == "TRAIN" or sick_df.SemEval_set[i] == "TRIAL"):
    sick_train_examples.append((sick_df.sentence_A[i],sick_df.sentence_B[i]))
    sick_train_labels.append(sick_df.entailment_label[i])
  elif(sick_df.SemEval_set[i] == "TEST"):
    sick_test_examples.append((sick_df.sentence_A[i],sick_df.sentence_B[i]))
    sick_test_labels.append(sick_df.entailment_label[i])


Check how many examples and label you have got in each partition:

In [None]:
print(len(sick_train_examples),len(sick_train_labels),len(sick_test_examples),len(sick_test_labels))

##4.1.2. Use a pretrained language model for sequence embedding

We can rely here on Huggingface which provides many pretrained models in its ```transformers``` library.

In [None]:
!pip install transformers

Here we can use a relatively small BERT-like model called DistilBert

In [None]:
import transformers

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-uncased",padding=True)

In [None]:
distilbert = transformers.AutoModel.from_pretrained("distilbert-base-uncased")

Tokeniser of the model does a lot of heavy lifting. It takes in a sentence or a pair of sentences, concatenates them, tokenizes, adds [CLS] token (id: 101) and separator(s) [SEP] token (id: 102) and returns a nensor with the list of token ids:

In [None]:
inputs = tokenizer(sick_df.sentence_A.tolist()[5],sick_df.sentence_B.tolist()[5], return_tensors="pt")

In [None]:
print(inputs)

How many tokens does the tokenizer recognise in sentences A and B in the example above?

__Answer:__

Not including the CLS token,

Sentence A: 13 tokens

Sentence B: 19 tokens

Together: 32 tokens

Together + CLS token: 33 Tokens

The token IDs can be decoded back into strings, for example:

In [None]:
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
for id, token in zip(inputs['input_ids'][0], tokens):
    print(f"{id.item():5d} → {token}")

Now we can pass the tensor output of the tokenizer through the model, getting its hidden states:

In [None]:
with torch.no_grad():

    output = distilbert(**inputs)

Above, ```torch.no_grad()``` guarantees that no gradients are computed in this code block. So the model's weights cannot be updated. This is what we want now: obtain sentence pair vectors from the model, without making any changes to the model itself.

You should use ```output.last_hidden_state``` that stores the last hidden layer. From that, you only need the first vector, which corresponds to the [CLS] token. It is the first in the sequence so has index 0.

**Exercise**. What is the size of that vector? Check!

In [None]:
#your code
print(len(output.last_hidden_state[0]))

Now we can define linear regression model in pyTorch:

In [None]:
class Regression(torch.nn.Module):
  def __init__(self, input_dim,output_dim):
    super(Regression,self).__init__()
    self.linear = torch.nn.Linear(input_dim,output_dim)

  def forward(self, x):
    outputs = self.linear(x)
    return outputs

It is the last vector of the [CLS] token that is normally used for sequence classification with BERT models. Train and test a logistic regression classifier on top of (frozen) DistilBERT embeddings for sentence pairs in SICK:

In [None]:
distilbert_embdim = distilbert.config.hidden_size

entailment_model = Regression(distilbert_embdim,3)

**Exercise**. Define loss and initialize an optimizer:

In [None]:
#your code here
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(entailment_model.parameters(), lr=0.01)


Now prepare the data: convert ```sick_train_examples``` and ```sick_test_examples``` into lists of torch tesors (embeddings of the sentence pairs).

__Hint__: to speed up neural computation, it can be helpful to move data and model to the GPU, a special processor that handles numeric tensor operations more efficiently (you may use commands like ```model.to('cuda')```, ```tensordata.to('cuda')```). For further non-neural computation, data can be moved back to the CPU ```tensordata.to('cpu')```. In case you (temporarily) don't have access to a GPU, you can use `'cpu'` instead of `'cuda'` to do the computation on a CPU. However, this will slow the computation down.

Processing all the sentence pairs in pretrained DistilBERT will take a couple of minutes.

In [None]:
#enabling GPU computation
distilbert.to('cuda')

sick_train_examples = [tokenizer(sentencePair[0],sentencePair[1], return_tensors="pt") for sentencePair in sick_train_examples]
sick_test_examples = [tokenizer(sentencePair[0],sentencePair[1], return_tensors="pt") for sentencePair in sick_test_examples]


Similarly, convert ```sick_train_labels``` and ```sick_test_labels``` into lists of indices:

In [None]:
sick_train_labels_idx= [i for i in range(len(sick_train_labels))]
sick_test_labels_idx= [i for i in range(len(sick_test_labels))]

**Exercise**. Train for 40 epochs and test a regression model on DistilBERT embeddings to classify SICK with the three entailment labels. Print out train and test accuracy and loss every 5 epochs.

In [None]:
epochs=40
#accuracy
def accuracy(predictions, scores):
  score = 0
  for i in range(0,len(predictions)):
    if predictions[i] == scores[i]:
      score +=1
  return score / len(scores)

# training model
for epoch in trange(epochs):

  if(epoch+1) % 5 ==0:
    with torch.no_grad():
      model.eval()


##4.1.3 Fine-tuning

Can you improve the performance of entailment classification even further?

_Fine-tune_ DistilBERT on the task: train and test a logistic regression classifier on top of DistilBERT embeddings for sentence pairs in SICK while updating the model weights. Define a new model class that combines DistilBERT and regression models into a single model, which passes sentence pair input through DistilBERT and uses the output embedding of the CLS token as input to regression, which finally produces the output of the whole combined model.

The weights of this bigger model (i.e. both DistilBERT weights and regression weights) can then be updated by the optimizer.

**Exercise**. Train the whole pipeline for 4 epochs using the train/test split as above.

For this last exercise, using a GPU is essential.

In [None]:
epochs = 4
#your code


#4.2 Autoregressive Transformer (GPT2)

##4.2.1 Text generation with GPT2


We will load and use the smallest version of GPT2 to save time and resources; it suffices for our purposes.

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the pre-trained model
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

To put GPT-2 to a test, we can create a list of examples as text snippets that GPT-2 is then asked to complete. We use three simple prompts here, taken from the beginning of NL Times articles in 2024.

In [None]:
prompts=[
    "The Schoof I Cabinet will likely have fewer Ministers and State Secretaries than the outgoing Rutte IV Cabinet, sources close to the formation process told the Telegraaf.",
    "Outgoing Agriculture Minister Piet Adema wants to ban business of yoga sessions that also involve puppies. Puppy yoga has been offered in Amsterdam for months now.",
    "The Walt Disney Company will participate in the Pride boat parade in the Netherlands for the first time on Saturday. "]

**Exercise**. Define functions ```run_on_prompt``` and ```run_on_prompts```. ```run_on_prompt``` takes a sting prompt and returns a list of n continuations of the prompt. ```run_on_prompts``` that takes a list of prompts and returns a list of lists of n continuations of each prompt.
Schematically, `run_on_prompts([prompt1,prompt2],nsamples=2)` should output `[[continuation1OfPrompt1,continuation2OfPrompt1],[continuation1OfPrompt2,continuation1OfPrompt2]]`.

*  Make sure your functions contain the `temperature` and `top_p` sampling parameters as you will be asked to experiment with them.
*  Your ```run_on_prompt``` function should also print the prompt and generated text continuations so you can inspect them.
*  Make sure that the outputs produced by ```run_on_prompt``` do not contain the prompt itself.

**Hints:**
* You'll need to use both the tokenizer and the model's generate() method. Look up the necessary details in the Hugging Face text [generation documentation](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation) and the [reference on generation strategies](https://huggingface.co/docs/transformers/generation_strategies).
* To extract just the continuation, compare the length of the original prompt to the generated text.

In [None]:
# Set padding token
tokenizer.pad_token = tokenizer.eos_token



def run_on_prompt(model, prompt, nsamples=3, length=1,
                   temperature=1,
                   top_k=None,
                   top_p=1):
    """
    Generates text continuations for a single prompt using a GPT model.

    Args:
        model: GPT model instance.
        prompt: A string prompt.
        nsamples: The number of samples to generate for the prompt.
        length: The maximum length of new tokens to generate.
        temperature: Temperature parameter for controlling randomness.
        top_k: Top-k sampling parameter.
        top_p: Top-p (nucleus) sampling parameter.

    Returns:
        A list of generated text continuations (without the prompt).
    """
    model.eval()
    device = next(model.parameters()).device
    input = tokenizer.encode(prompt, return_tensors="pt").to(device)
    with torch.no_grad():
      output = model.generate(input, max_new_tokens=length, do_sample=True, temperature=temperature, top_k=top_k if top_k is not None else 0, top_p=top_p, num_return_sequences=nsamples, pad_token_id=tokenizer.eos_token_id)
    generated = []
    for o in output:
      generated.append(tokenizer.decode(o[input.shape[-1]:], skip_special_tokens=True))
    print(f"prompt: {prompt}\n")
    for g in generated:
      print(f"generated: {g}\n")
    return generated


def run_on_prompts(model, prompt_list, nsamples=3, length=1,
                   temperature=1,
                   top_k=None,
                   top_p=1):
    """
    Generates text continuations for a list of prompts using a GPT model.

    Args:
        model: GPT model instance.
        prompt_list: A list of strings, where each string is a prompt.
        nsamples: The number of samples to generate for each prompt.
        length: The maximum length of new tokens to generate.
        temperature: Temperature parameter for controlling randomness.
        top_k: Top-k sampling parameter.
        top_p: Top-p (nucleus) sampling parameter.

    Returns:
        A list of lists, where each inner list contains continuations for one prompt.
    """
    output = []
    for p in prompt_list:
      output.append(run_on_prompt(model, p, nsamples, length, temperature, top_k, top_p))
    return output

Run GPT-2 with your function with different temperature values: 0.4 (low), 0.8,  1 (default) and 2 (high). Extreme values are suggested for pedagogical purposes.

In [None]:
# Move to GPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'
gptsmall = model.to(device)
temp04outputs=run_on_prompts(gptsmall,prompts, length=100, temperature=0.4)

In [None]:
temp08outputs=run_on_prompts(gptsmall,prompts, length=100, temperature=0.8)

In [None]:
temp1outputs=run_on_prompts(gptsmall,prompts, length=100, temperature=1)

In [None]:
temp2outputs=run_on_prompts(gptsmall,prompts, length=100, temperature=2.0)

Inspect the generated texts. What are your observations about the role of temperature?

**Answer**:
At a lower temperature, while the information the model gives us might be neither correct nor logically sound, the model at least generates text that is readable and consists mostly of correct sentences related to the prompt, while at higher temperatures the model starts "rambling" without any visible structure and using words lacking relation to the prompt or other words in the sentences.

Now keep the temperature at default and run text generation with nucleus sampling (p-sampling) with p at 0.2, 0.3, 0.5, and 0.95.

In [None]:
outputsp2=run_on_prompts(gptsmall,prompts, length=100,top_p=0.2)

In [None]:
outputsp3=run_on_prompts(gptsmall,prompts, length=100,top_p=0.3)

In [None]:
outputsp5=run_on_prompts(gptsmall,prompts, length=100,top_p=0.5)

In [None]:
outputsp95=run_on_prompts(gptsmall,prompts, length=100,top_p=0.95)

What are your observations on the outputs at different values of p?

**Answer**
at low p values the model is rather unoriginal and seems to first generate a rather safe sentence that is very similar to the prompt and repeat it, while at the other extreme it is quite creative, never sticking to the same subject, involving lots of made up people, but creating relatively (grammatically) correct sentences.

##4.2.2 Automatic measuring of text diversity

You made some qualitative observartions on the output. Can we also identify quantitative differences between the generated texts? One approach is to measure the fraction of unique fragments (e.g. words or word sequences) in text.

Define a function `uniqN` that takes a string `ss` and returns the proportion of unique token n-grams of length `n`. For simplicity, separate the string into words using `split()`.

In [None]:
def uniqN(ss,n):
  seen = []
  ss = ss.split()
  for s in ss:
    if len(s) == n:
      if s not in seen:
        seen.append(s)
  return len(seen) / len(ss)

Now define a function `avgrepN` which measures the diversity of text in an input corpus `c` (a list of lists of strings), returning the average `uniqN` value for each string in the corpus for n-grams ranging from length 1 to `maxn`. High values mean the texts are diverse, low values indicate a lot of repeating n-grams.

In [None]:
def avgrepN(c,maxn=3):
  unique = 0;
  i = 0;
  for n in range(maxn):
    for sc in c:
      for ss in sc:
        unique += uniqN(ss, n + 1)
        i += 1
  return unique / i

Calculate the average diversity for texts generated with diverse values of p under nucleus sampling.

In [None]:
print(avgrepN(outputsp2))
print(avgrepN(outputsp3))
print(avgrepN(outputsp5))
print(avgrepN(outputsp95))


Calculate the average diversity for texts generated with diverse values of temperature.

In [None]:
print(avgrepN(temp04outputs))
print(avgrepN(temp08outputs))
print(avgrepN(temp1outputs))
print(avgrepN(temp2outputs))

Now we can compare the GPT2-generated outputs with reference texts, i.e. continuations of the prompts in the actual texts.

In [None]:
REFERENCE1='''These types of lessons involve young dogs running around that the participants can cuddle after the lesson.
Adema said he does not believe it is healthy for the young animals. \"I don't think that is suitable. Puppies need to sleep. They are at a very early stage of their development,\" he stated after the regular weekly Cabinet meeting.
\"It serves no purpose at all and makes no sense. It really has to stop.\" He is preparing a draft proposal of the ban so that his replacement in the next Cabinet can implement it.
'''
REFERENCE2='''The Schoof I Cabinet will likely have fewer Ministers and State Secretaries than the outgoing Rutte IV Cabinet, sources close to the formation process told the Telegraaf. As the intended Prime Minister Dick Schoof is considered “party-less” and represents all four parties in the coalition, the Cabinet will likely have four Deputy Prime Ministers - one each from the PVV, VVD, NSC, and BBB.
Schoof was sparing with information after his first formation session. '\“It was a beautiful day,\” he told the press after meeting with the leaders of the coalition parties and formateur Richard van Zwol.
'''

REFERENCE3='''The employees are taking part in the initiative of a company working group that promotes inclusion.
Disney has increasingly focused on inclusion in recent years. Pride Walks have already taken place in London, Berlin, and Paris. However, this is the first time that the company has taken such a clear stand and made such a visible statement during a Dutch Pride. "We are very pleased with the great interest and diversity of registrations for this year's boat parade," said the Utrecht Pride organization. \"The selected boats show what Utrecht Pride stands for, making the LGBTIQ+ community visible in all its facets, and we look forward to everyone enjoying this beautiful event, both on the water and along the side.\"'''

In [None]:
print(avgrepN([[REFERENCE1],[REFERENCE2],[REFERENCE3]]))

How do the reference texts compare to texts generated under different values of p or temperature?

**Answer**
In terms of diversity as well as "feel" when reading the reference texts are the most similar to a temperature of about 1 and a p of about 0.5. When decreasing either the output becomes too safe and unoriginal compared to the reference but use higher values and the model either begins rambling or has trouble producing coherent sentences.