<a href="https://colab.research.google.com/github/SamBoeve/LLM-Workshop/blob/main/LLMs_in_Psycholinguistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **From Tool to Theory: LLMs in Psycholinguistics**



# Before we start

Here I'll illustrate the three main use cases of LLMs in psycholinguistics.

For each, I'll download a separate model and tokenizer from the Hugging Face website. Model names and additional info can be found on the website:
https://huggingface.co/


In [None]:
import pandas as pd
import numpy as np
import torch.nn as nn
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import BertTokenizer, BertModel
from transformers import AutoModel, AutoConfig, AutoModelForCausalLM, AutoTokenizer

#Usecase 1

## **LLMs as participants ~ Prompting**


Let's ask a large language model how familiar it is with certain words.

I used this paper as inspiration for the prompt:

Conde, J., Martínez, G., Grandury, M., Arriaga-Prieto, C. & Haro, J., Schroeder, S., Hintz, F., Reviriego, P. & Brysbaert, M. (2025). Updating the German psycholinguistic word toolbox with AI-generated estimates of familiarity, concreteness, valence, arousal, and age of acquisition. 10.13140/RG.2.2.18709.64489.

In [None]:
# Load a model finetuned for question answering
model_checkpoint = "google/flan-t5-base"

model = T5ForConditionalGeneration.from_pretrained(model_checkpoint)

tokenizer = T5Tokenizer.from_pretrained(model_checkpoint)

In [None]:
# List of words to rate
words = ["labyrinth", "mysterious", "president", "education", "eloquent", "kitchen"]

# Loop over the words
for word in words:
    prompt = f"""
    Complete the following task as a native speaker of English.
    Familiarity is a measure of how familiar something is.
    Please indicate how familiar you think each English word is to a native English speaker on a scale from 1 (VERY UNFAMILIAR) to 7 (VERY FAMILIAR),
    with the midpoint representing moderate familiarity.
    Typically people are very unfamiliar with words like acumen or ostentatious, while they are more familiar with words like spoon or university.
    Only answer a number from 1 to 7. Please limit your answer to numbers.
    The English word is: {word}. What familiarity rating would you give this word?
    """
    print(prompt)

    # Tokenize the input
    inputs = tokenizer(prompt, return_tensors="pt")

    # Run the input through the model
    outputs = model.generate(
        **inputs,
        max_new_tokens=5,
        temperature=0.1,
        # do_sample=True,       # enable sampling
        # num_beams=5,          # explore multiple continuations
        # temperature=0.7,      # allow moderate diversity
        # top_p=0.9,
        # num_return_sequences=3
    )

    # Decode the model's output
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(result)

    # for i, output in enumerate(outputs):
    #   print(f"Sample {i+1}: {tokenizer.decode(output, skip_special_tokens=True).strip()}")


**Note:**

These results don't make a lot of sense. There are a few potential reasons:

1. We used a small, open-source model (200 million parameters). Larger models might produce better results. For reference, GPT-4o has approximately (200 billion parameters) but this model is not open-source. Models like this you can access through an API call.

2. The model is trained to answer questions but estimating word familiarity is not part of the standard training set. Finetuning the model on existing human familiarity judgements could improve the results.

3. Prompt engineering: changes to the prompt could improve (or worsen) the responses.

4. Other sampling strategies could have an impact on the outputs.

#Usecase 2

## **Representations**

In [None]:
# For this we load a masked language model
model_checkpoint = "google-bert/bert-base-uncased"

tokenizer = BertTokenizer.from_pretrained(model_checkpoint)

model = BertModel.from_pretrained(model_checkpoint)

Let's see how the model represents the word *bank* in two different sentences **prior** to contextualizing the input.

Before working its magic, a language model uses a standard word embedding vector to represent a word (e.g., word2vec).

Remember, for now these word vectors are not contextualized.

In [None]:
# Two sentences with differnt use of the word 'bank'
sent1 = "He deposited money in the bank."
sent2 = "He admired the flowers on the river bank."

# First, we tokenize both sentences
word_id1 = tokenizer.encode(sent1, add_special_tokens=False)
word_id2 = tokenizer.encode(sent2, add_special_tokens=False)
print(word_id1)
print(word_id2)

# Next, convert the token IDs to their associated word embeddings
# Note that these are non-contextualized word embeddings
static_embedding1 = model.embeddings.word_embeddings.weight[word_id1]
static_embedding2 = model.embeddings.word_embeddings.weight[word_id2]

# We get a tensor with dimension [sequence length, embedding dimension]
print(static_embedding1.shape)
print(static_embedding2.shape)

# Compute the cosine similarity
cos = nn.CosineSimilarity(dim=0, eps=1e-6)
output = cos(static_embedding1[5, :], static_embedding2[7, :]).item()
print(f"Cosine similarity: {output:.2f}")

# Show the first five dimensions of the word embedding of 'bank' in both sentences
print(static_embedding1[5, :5])
print(static_embedding2[7, :5])

Now let's see how the vector representations of the word *bank* changes **after** we run the sentence through the BERT model.

In [None]:
target = 'bank'

sentences = [
    "He deposited money in the bank.",
    "He admired the flowers on the river bank."
]

contextual_embeddings_list = []

for sent in sentences:
  # Tokenize the input
  inputs = tokenizer(sent, return_tensors="pt")

  # Run the input through the model
  outputs = model(**inputs, output_hidden_states=True)

  # Collect the model's hidden states (i.e., the contextualized embeddings)
  last_hidden_states = outputs.hidden_states[-1]

  # Find token positions corresponding to the target word
  tokenized = tokenizer.tokenize(sent)
  positions = [i for i, t in enumerate(tokenized) if t == target]
  print(positions)

  # Collect embeddings of the target word
  contextual_embedding = last_hidden_states[0, positions, :]

  # Store the embeddings
  contextual_embeddings_list.append(contextual_embedding)

# Compute cosine similarity between the two contextualized embeddings
sim_between_contexts = cos(contextual_embeddings_list[0], contextual_embeddings_list[1])
print(f"Cosine similarity between 'bank' in the two contexts: {sim_between_contexts.item():.2f}")

# Usecase 3

## **Word Probabilities**

In [None]:
# load an autoregressive model
model_checkpoint = "openai-community/gpt2"

model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

First, we'll just extract the top ten predictions of the model based on a given sentence frame.

In [None]:
text = ' This is a'

# Tokenize the input
input_ids = tokenizer(text, return_tensors="pt").input_ids
print(input_ids)
# tensor([[770, 318, 257]])

# Run the input through the model
outputs = model(input_ids)
print(outputs.logits.shape)
# outputs.logits 3D tensor of shape (1, 3, 50.000) (batch, sequence_length, vocab_size)

# print(outputs.logits)
# tensor([[[ -35.2499,  -34.7863,  -38.6658,  ...,  -41.7192,  -40.9740,
#            -35.1116],
#          [-109.9207, -110.7950, -116.1922,  ..., -118.8029, -117.7631,
#           -112.9502],
#          [-110.4429, -110.4550, -113.5300,  ..., -119.2905, -115.9005,
#           -111.2370]]], grad_fn=<UnsafeViewBackward0>)

# Apply a softmax normalization to the logits to get probabilities
# The fucntion below applies softmax independently for every [batch_index, sequence_index] pair across the vocabulary axis (the vocab_size dimension).
probs = torch.softmax(outputs.logits, dim=-1).detach()

# print(probs)
# tensor([[[7.3724e-04, 1.1720e-03, 2.4216e-05,  ..., 1.1429e-06,
#           2.4080e-06, 8.4655e-04],
#          [4.7575e-05, 1.9847e-05, 8.9890e-08,  ..., 6.6052e-09,
#           1.8684e-08, 2.2999e-06],
#          [6.7679e-06, 6.6863e-06, 3.0885e-07,  ..., 9.7272e-10,
#           2.8858e-08, 3.0589e-06]]])

# We are only interested in the prediction of what comes after the final word
last_position_probs = probs[:, -1, :]

# Across the entire vocab, select the indices with the highest probability
top_10_values, top_10_indices = torch.topk(last_position_probs, k=10, dim=-1)

# Decode the tokens associated to those indices and print the result
decoded_top_10 = [tokenizer.decode([idx]) for idx in top_10_indices[0]]
print(decoded_top_10)

A more interesting application is calculating the probability of the tokens like they occur in the text.

We will transform the probabilities to surprisal values as is the standard practice in psycholinguistic studies.

In [None]:
def to_tokens_and_logprobs(model, tokenizer, input_texts):

    # Tokenize the input
    input_ids = tokenizer(input_texts, return_tensors="pt").input_ids

    # Run the input through the model
    outputs = model(input_ids)

    # Apply softmax to get probabilities and take the negative logarithm to get the surprisal value
    probs = -torch.log_softmax(outputs.logits, dim=-1).detach()

    # Align token probabilities with their actual token positions
    probs = probs[:, :-1, :]         # remove final position (we don't want prediction, only the probabilities of existing tokens)
    input_ids = input_ids[:, 1:]     # shift input tokens by one

    # Gather the probabilities corresponding to the actual generated tokens
    gen_probs = torch.gather(probs, 2, input_ids[:, :, None]).squeeze(-1)

    # Decode tokens and pair with their surprisal values
    all_probs = []
    for input_sentence, input_probs in zip(input_ids, gen_probs):
        token_probs = []
        for token, p in zip(input_sentence, input_probs):
            if token not in tokenizer.all_special_ids:  # skip special tokens like BOS tokens
                token_probs.append((tokenizer.decode(token), p.item()))
        all_probs.append(token_probs)

    return all_probs

In [None]:
input_text = ["<|endoftext|> This is an example on how to calculate each token's surprisal"]
probabilities = to_tokens_and_logprobs(model, tokenizer, input_text)

print(probabilities)

# More information, code and courses

[Models](https://huggingface.co/models)

[Hugging Face Courses](https://huggingface.co/learn/llm-course/chapter1/1)

[A systematic evaluation of Dutch large language models’ surprisal estimates ](https://link.springer.com/article/10.3758/s13428-025-02774-4)

[Bogaertslab ](https://www.bogaertslab.com/)


