In this assignment you will be asked to extend the work by Gatti et al by checking whether form-meaning mappings learned on a different yet related language to that considered in the original study still capture the perceived valence of pseudowords. To do this you will be asked to engage with several different resources and adapt the pipeline following the instructions. Along the way, you will be asked to answer a few questions.

You need to submit the complete notebook in .ipynb format, with intermediate outputs visible. The notebook should be named as follows:

CL2025_groupN_assignment.ipynb

where N is the group number. Submissions in the wrong format or with names not adhering to the guidelines will not be evaluated.

Indicate group members' names, student numbers, and contributions below:
- 1. Lieke van Eijk
- 2. Aimélie Speet (2103752);
- 3. Fleur Sülter
- 4. Julian Van Dijk
- 5. Lars Heijnen

In [None]:
# the code has been tested using the psycho-embeddings library to extract representations from LLMs. You can also use other libraries,
# as long as you make sure that you are producing the correct output.
!git clone https://github.com/MilaNLProc/psycho-embeddings.git
%cd psycho-embeddings
!pip install datasets
!pip install fasttext
!pip install torch

In [None]:
# the solution to the assignment has been obtained using these packages.
# you're free to use other packages though: consider this as an indication, not a prescription.
import nltk
import numpy as np
import pandas as pd
import fasttext as ft
import pickle as pkl
import fasttext.util
from tqdm import tqdm
from collections import defaultdict
from transformers import AutoTokenizer
from psycho_embeddings import ContextualizedEmbedder
from IPython.display import display
from collections import Counter
import pandas as pd
import io
import requests
import pyreadr

**Task 1** (*10 points available, see breakdown per task below*)

You should replicate the main design in the paper *Valence without meaning* by Gatti and colleagues (2024), using estimates collected for Dutch word valence to train linear regression models and apply them to predict the valence of English pseudowords from Gatti and colleagues.

In detail, to train your regression models, you should use the dataset by Speed and Brysbaert (2024) containing crowd-sourced valence ratings (use the metadata to identify the relevant columns) collected for approximately 24,000 Dutch words. See the paper *Ratings of valence, arousal, happiness, anger, fear, sadness, disgust, and surprise for 24,000 Dutch words* by Speed and Brysbaert (2024).

You should train a letter unigram model and a bigram model. Each model should be trained on Dutch words only.

Pay attention to one issue though: pseudowords created for English may be valid words in Dutch: therefore, you should first filter the list of pseudowords against a large store of Dutch words. To do so, use the words in the Dutch prevalence lexicon available in this OSF repository: https://osf.io/9zymw/. Essentially, you need to exclude any pseudoword that happens to be a word for which a prevalence estimate is available, whatever the prevalence is.

Each code block indicates how many points are available and how they are attributed.

In [3]:
# read in the pseudowords from Gatti and colleagues, as well as the valence ratings for 24,000 Dutch words from Speed and Brysbaert (2024)
# show the first 5 lines of each dataset.
# 1 point for identifying the correct files and correctly loading their content

#Gatti Data
gatti_data =pyreadr.read_r('/Users/larsheijnen/CL/data/data_pseudovalence.RData')

for key, value in gatti_data.items():
    print(f"Variable name: {key}")
    display(value.head())

#Speed and Brysbaert (2024)
speed_and_rysbaert_all = pd.read_excel("/Users/larsheijnen/CL/data/All_Valence.xlsx", sheet_name = "All")
speed_and_rysbaert_means = pd.read_excel("/Users/larsheijnen/CL/data/All_Valence.xlsx", sheet_name = "Means")
display(speed_and_rysbaert_all.head())
display(speed_and_rysbaert_means.head())

Variable name: data_fin


Unnamed: 0_level_0,Valence,predicted_val,predicted_valL,predicted_valL_BI,predicted_valDIM,predicted_valL_DIM,predicted_valBI,predicted_valBI_DIM
rownames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
aardvark,6.26,6.392012,4.92018,6.410768,5.772722,5.774341,6.410768,6.392012
abalone,5.3,4.756492,5.284912,5.115389,4.728264,4.85812,5.115389,4.756492
abandon,2.84,4.260055,5.001226,5.47986,3.978241,3.987623,5.47986,4.260055
abandonment,2.63,4.196807,5.022504,5.334364,3.83333,3.828077,5.334364,4.196807
abbey,5.85,6.123953,5.147159,5.162931,6.064834,6.094675,5.162931,6.123953


Variable name: data_2


Unnamed: 0,X,pseudoword,Value,predicted_valence,predictedL_valence,predictedL_Bi_valence,predicted_Dim_valence,predictedL_Dim_valence,predictedBi_Dim_valence,predictedBi_valence,LDist,Ortho_VAL,Semant_Neigh,SDist,Semant_VAL
0,1,abhert,0.452501,7.414814,5.116167,6.444633,6.783771,6.630497,7.414814,6.444633,2,4.655714,ordinary,0.558492,5.05
1,2,abhict,0.434171,8.233714,5.059183,6.509936,7.366068,7.377534,8.233714,6.509936,2,3.093333,cardigan,0.622202,5.95
2,3,acleat,0.527803,5.552468,5.262971,5.245826,5.268643,5.396114,5.552468,5.245826,1,4.24,solarium,0.57515,6.1
3,4,acmure,0.604889,8.71464,5.120029,6.562896,7.680827,7.58323,7.80991,5.414532,2,5.885,bad,0.570299,3.24
4,5,acoed,0.53899,7.340002,5.115652,5.309727,7.105662,7.024771,7.340002,5.309727,1,5.68,girl,0.499035,7.15


Variable name: data_3


Unnamed: 0,pseudoword,VAL2,Elo,RW,Best,Worst,Unchosen,BestWorst,ABW,David,...,predictedL_Bi_valence,predicted_Dim_valence,predictedL_Dim_valence,predictedBi_Dim_valence,predictedBi_valence,LDist,Ortho_VAL,Semant_Neigh,SDist,Semant_VAL
0,acleat,0.511226,-14.04963,0.516195,3,2,25,8.833333,0.066691,9400,...,5.245826,5.268643,5.396114,5.552468,5.245826,1,4.24,solarium,0.57515,6.1
1,acmure,0.539304,117.178153,0.543576,9,6,15,25.5,0.200671,62800,...,6.562896,7.680827,7.58323,7.80991,5.414532,2,5.885,bad,0.570299,3.24
2,acraw,0.468173,-63.085202,0.462979,3,5,22,-16.166667,-0.133531,-33300,...,4.897807,8.202013,8.220112,7.968543,4.897807,2,5.044706,side,0.57206,5.32
3,adlor,0.601365,119.915944,0.603029,7,1,22,50.5,0.405465,139500,...,4.625245,5.855695,5.899081,5.69689,4.625245,2,5.640667,act,0.589797,5.64
4,adpite,0.573363,106.951853,0.574467,7,2,21,42.166667,0.336472,103500,...,5.100143,8.216379,8.225612,8.363778,5.100143,2,6.066667,regard,0.590583,5.7


Variable name: .Random.seed


Unnamed: 0,.Random.seed
0,10403
1,593
2,1050179519
3,2033100213
4,-1373968898


Variable name: Count


Unnamed: 0,Word,a,b,c,d,e,f,g,h,i,...,Dim_292,Dim_293,Dim_294,Dim_295,Dim_296,Dim_297,Dim_298,Dim_299,Dim_300,Valence
0,aardvark,3,0,0,1,0,0,0,0,0,...,0.04983,-0.05288,0.019918,-0.003339,-0.005436,0.039293,-0.010782,-0.02301,0.007921,6.26
1,abalone,2,1,0,0,1,0,0,0,0,...,0.01909,-0.083532,0.024157,-0.006709,-0.005889,0.019107,0.054735,0.026275,0.026177,5.3
2,abandon,2,1,0,1,0,0,0,0,0,...,-0.01842,0.003779,0.011741,-0.012012,0.007799,-0.062272,-0.006584,-0.008598,-0.012287,2.84
3,abandonment,2,1,0,1,1,0,0,0,0,...,-0.020326,-0.040106,0.000867,-0.022475,-0.013669,0.021974,0.021332,0.021166,0.003248,2.63
4,abbey,1,2,0,0,1,0,0,0,0,...,0.077062,-0.073641,-0.014475,0.034482,-0.01115,0.028477,0.034331,0.018858,-0.047663,5.85


Variable name: comb_2


Unnamed: 0,Word,Value1,Value2
0,abhert,0.473009,0.406491
1,abhict,0.375453,0.472723
2,acleat,0.58384,0.496628
3,acmure,0.607354,0.597101
4,acoed,0.526847,0.551518


Variable name: comb_3


Unnamed: 0,Word,Value1,Value2
0,acleat,0.493853,0.533178
1,acmure,0.578694,0.520666
2,acraw,0.506507,0.430297
3,adlor,0.598781,0.599522
4,adpite,0.598732,0.536912


Unnamed: 0,List,Participant,Word,Valence,Unknown,RemoveParticipant
0,Lijst 5,Lijst 5_PP1,aai,5.0,0,0
1,Lijst 5,Lijst 5_PP11,aai,3.0,0,0
2,Lijst 5,Lijst 5_PP12,aai,3.0,0,0
3,Lijst 5,Lijst 5_PP2,aai,3.0,0,0
4,Lijst 5,Lijst 5_PP3,aai,4.0,0,0


Unnamed: 0,Word,Valence,N_Unknown,N_Valence,ProportionUnknown,RemoveUnknown
0,concordantie,3.222222,11,9,0.55,1
1,nepotisme,2.111111,10,9,0.526316,1
2,prefectuur,3.222222,10,9,0.526316,1
3,prevaleren,3.444444,10,9,0.526316,1
4,affiliatie,3.1,9,10,0.473684,0


In [4]:
# filter out pseudowords that happen to be valid Dutch words (mind case folding!)
# show the set of pseudowords filtered out.
# 1 point for applying the correct filtering

# Read the valid Dutch words file with tab separator
valid_dutch_words = pd.read_csv('/Users/larsheijnen/CL/data/prevalence_netherlands.csv', sep='\t')


# Ensure all words are lowercase for case-insensitive comparison
valid_dutch_words_set = set(valid_dutch_words['word'].str.lower())
# print(list(valid_dutch_words_set)[:5]) #['middenklasse', 'vraat', 'opduvel', 'pleiter', 'bosgeest']


# Get pseudowords from Gatti et al (using 'comb_3' as example, adjust if needed)
pseudowords = gatti_data['Count']['Word'].str.lower()

# Filter out pseudowords that are valid Dutch words
filtered_out = set(pseudowords).intersection(valid_dutch_words_set)

# Show the set of pseudowords filtered out
print(filtered_out)



{'petticoat', 'drainage', 'ponder', 'delirium', 'indifferent', 'ammonium', 'microfilm', 'lotto', 'bureau', 'skinny', 'haven', 'rotten', 'fungus', 'ska', 'accent', 'bottleneck', 'pension', 'voyeur', 'pulsar', 'boogie', 'biceps', 'prudent', 'veranda', 'angora', 'toaster', 'hertz', 'live', 'nipper', 'pin', 'project', 'turbine', 'uniform', 'decor', 'sheet', 'impromptu', 'drama', 'fox', 'caravan', 'arrangement', 'deficit', 'grip', 'wringer', 'mono', 'wireless', 'pot', 'albino', 'basis', 'buddy', 'trapeze', 'lag', 'teen', 'harp', 'charge', 'feeder', 'hop', 'gorilla', 'module', 'sport', 'hostess', 'counseling', 'appendicitis', 'snowboard', 'intolerant', 'trivia', 'rum', 'petroleum', 'ban', 'waterproof', 'coherent', 'credit', 'loom', 'bowler', 'pantry', 'citrus', 'spleen', 'apex', 'paperback', 'collie', 'operator', 'memorabilia', 'masochist', 'flux', 'salmonella', 'irrelevant', 'sassafras', 'nitwit', 'omelet', 'baron', 'giraffe', 'pastor', 'camouflage', 'end', 'bouquet', 'ride', 'penny', 'mist

In [5]:
# encode Dutch words and pseudowords from Gatti et al as uni- and bi-gram vectors
# show the uni-gram and bi-gram encoding of the pseudoword ampgrair
# 2 points for correctly encoding the target strings as uni- and bi-gram vectors


def unigram_vector(word, vocab=None):
    counts = Counter(word)
    if vocab is not None:
        return [counts.get(char, 0) for char in vocab]
    else:
        return counts

def bigram_vector(word, vocab=None):
    bigrams = [word[i:i+2] for i in range(len(word)-1)]
    counts = Counter(bigrams)
    if vocab is not None:
        return [counts.get(bigram, 0) for bigram in vocab]
    else:
        return counts

# build vocabularies from Dutch words
dutch_words = speed_and_rysbaert_means['Word'].str.lower().tolist()
all_unigrams = sorted(set(''.join(dutch_words)))
all_bigrams = sorted(set(b for w in dutch_words for b in [w[i:i+2] for i in range(len(w)-1)]))

# get pseudowords (excluding filtered out ones)
pseudoword_list = [w for w in gatti_data['comb_3']['Word'].str.lower() if w not in filtered_out]

# example: encode 'ampgrair'
test_word = 'ampgrair'
uni_vec = unigram_vector(test_word, vocab=all_unigrams)
bi_vec = bigram_vector(test_word, vocab=all_bigrams)

print(f"Unigram encoding for {test_word}':", uni_vec)
print(f"Bigram encoding for '{test_word}':", bi_vec)


Unigram encoding for ampgrair': [2, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Bigram encoding for 'ampgrair': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [6]:
# use word valence estimates from Speed and Brysbaert (2024) to train
# - a uni-gram model
# - a bi-gram model
# 2 points for correctly trained models

from sklearn.linear_model import LinearRegression


# prepare X and y for Dutch words
X_uni = [unigram_vector(w, vocab=all_unigrams) for w in dutch_words]
X_bi = [bigram_vector(w, vocab=all_bigrams) for w in dutch_words]
y = speed_and_rysbaert_means['Valence'].values

# train models
uni_model = LinearRegression().fit(X_uni, y)
bi_model = LinearRegression().fit(X_bi, y)

In [7]:
# apply trained models to predict the valence of pseudowords from Gatti et al (2024).
# Then apply the same models back onto the training set to see how well they predict the valence of words in Speed and Brysbaert (2024).
# 2 points for correctly applied models

# prepare unigram and bigram vectors for pseudowords
X_uni_pseudo = [unigram_vector(w, vocab=all_unigrams) for w in pseudoword_list]
X_bi_pseudo = [bigram_vector(w, vocab=all_bigrams) for w in pseudoword_list]

# predict valence for pseudowords
pseudo_pred_uni = uni_model.predict(X_uni_pseudo)
pseudo_pred_bi = bi_model.predict(X_bi_pseudo)

# predict valence for Dutch words (training set)
train_pred_uni = uni_model.predict(X_uni)
train_pred_bi = bi_model.predict(X_bi)

In [8]:
# compute the Spearman correlation coefficients between true valence and predicted valence under both uni- and bi-gram models for
# - words from Speed and Brysbaert (2024)
# - pseudowords from Gatti and colleagues (2024)
# show both correlation coefficients.
# 2 points for the correct Spearman correlation coefficients (rounded to the third decimal place)

from scipy.stats import spearmanr

# true valence for Dutch words
true_valence_words = y

# true valence for pseudowords (use Value1 from gatti_data['comb_3'])
true_valence_pseudowords = gatti_data['comb_3']['Value1'].values

# uni-gram model
corr_uni_words, _ = spearmanr(true_valence_words, train_pred_uni)
corr_uni_pseudo, _ = spearmanr(true_valence_pseudowords, pseudo_pred_uni)

# bi-gram model
corr_bi_words, _ = spearmanr(true_valence_words, train_pred_bi)
corr_bi_pseudo, _ = spearmanr(true_valence_pseudowords, pseudo_pred_bi)

print(f"Unigram model: Spearman r (words) = {corr_uni_words:.3f}, Spearman r (pseudowords) = {corr_uni_pseudo:.3f}")
print(f"Bigram model:  Spearman r (words) = {corr_bi_words:.3f}, Spearman r (pseudowords) = {corr_bi_pseudo:.3f}")

Unigram model: Spearman r (words) = 0.089, Spearman r (pseudowords) = 0.173
Bigram model:  Spearman r (words) = 0.321, Spearman r (pseudowords) = 0.027


**Task 2** (*8 points available, see breakdown below*)

Again following Gatti and colleagues, you should encode the target strings (pseudowords and Dutch words from Speed and Brysbaert) as fastText embeddings, train a multiple regression model on Dutch words and apply it to the pseudowords in Gatti et al. You should finally report the Spearman correlation coefficient between observed and predicted valence for both words and pseudowords.

You should use the pre-trained fastText model for Dutch, available at this page: https://fasttext.cc/docs/en/crawl-vectors.html

Finally, you should answer two questions about the fastText model (see below).

In [13]:
# load the fastText model
# 1 point for correctly loading the appropriate fastText model

import fasttext.util
fasttext.util.download_model('en', if_exists='ignore')  # English
ft = fasttext.load_model('cc.en.300.bin')

Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz

KeyboardInterrupt: 

What is the dimensionality of the pre-trained Dutch fastText embeddings? (*1 point for the correct answer*)

In [None]:
# encode Dutch words and pseudowords as fastText embeddings
# show the first 20 values of the embedding of the word 'speelplaats' and of the pseudoword 'danchunk'
# 2 points for correctly encoding words and pseudowords with fastText

# Get embedding for 'speelplaats' (Dutch word)
speelplaats_vec = ft.get_word_vector('speelplaats')
print("First 20 values for 'speelplaats':", speelplaats_vec[:20])

# Get embedding for 'danchunk' (pseudoword)
danchunk_vec = ft.get_word_vector('danchunk')
print("First 20 values for 'danchunk':", danchunk_vec[:20])

In [None]:
# train regression model on word valence
# 1 point for correctly training the regression model

In [None]:
# apply the trained model to predict the valence of pseudowords from Gatti et al (2024).
# Then apply the same model back onto the training set to see how well it predicts the valence of words in Speed and Brysbaert (2024).
# 1 point for correctly applied model

In [None]:
# compute the Spearman correlation coefficients between true valence and predicted valence for
# - words from Speed and Brysbaert (2024)
# - pseudowords from Gatti and colleagues (2024)
# show the correlation coefficient.
# 1 point for the correct Spearman correlation coefficients (rounded to the third decimal place)

**Task 3** (*6 points available, see breakdown below*)

Now you are asked to extend the work by Gatti et al by also considering the representations learned by a transformer-based models, in detail *RobBERT v2* (https://huggingface.co/pdelobelle/robbert-v2-dutch-base). You should follow the same pipeline as for the previous models, encoding both Dutch words from Speed and Brysbaert (2024) and the pseudowords from Gatti et al using the embedding of each string at layer 0, before positional information is factored in. If a string consists of multiple tokens, average the embeddings of all tokens to produce the embedding of the whole string. Then train a multiple regression model on the valence of Dutch words, apply it to the pseudowords, and compute the Spearman correlation between observed and predicted ratings.

Use the HuggingFace model card for RobBERT v2 to check how to access it.

I recommend saving the embeddings to file once you have generated them and you know they are correct: embedding thousands of strings takes some time, and you don't want to have to do it again. For the same reason, develop your code by considering only a small fractions of the words and pseudowords, in order to quickly see if something is wrong. Only when you are positive it works, embed all strings.

In [None]:
# load and instantiate the right model
# 1 point for loading the right model

In [None]:
# encode the words and pseudowords using RobBERT v2. I've used the free GPU runtime on COLAB to speed things up,
# but in this case you need to batch the words and pseudowords. You can use the function below to create batches
# but you will have to pay attention at how you store embeddings.
# show the first 20 values of the embedding of the word 'miauwen' and of the pseudoword 'lixthless'
# 2 points for correctly encoding words and pseudowords

def chunks(lst, n):

    """Chunks a list into equal chunks containing n elements. Returns a list of lists."""

    chunked = []
    for i in range(0, len(lst), n):
        chunked.append(lst[i:i + n])
    return chunked


In [None]:
# train regression model on word valence estimates from Speed and Brysbaert (2024)
# 1 point for correctly training the regression model

In [None]:
# apply the trained model to predict the valence of pseudowords from Gatti et al (2024).
# Then apply the same model back onto the training set to see how well it predicts the valence of words in Speed and Brysbaert (2024).
# 1 point for correctly applied model

In [None]:
# compute the Spearman correlation coefficients between true valence and predicted valence for
# - words from Speed and Brysbaert (2024)
# - pseudowords from Gatti and colleagues (2024)
# show the correlation coefficient
# 1 point for the correct Spearman correlation coefficients (rounded to the third decimal place)

**Task 4** (*16 points available, 4 for each question*)

Answer the following questions.

**4a.** Describe the performance of each featurization, comparing
- the performance of a same model between the training and test set
- the performance of different models on the training set
- the performance of different models on the test set

(*4 points available, max 150 words*)

*type your answer here*

**4b.** Compare the correlations you found when training uni-gram, bi-gram, and fastText models on Dutch words and the correlations of similar models trained on English data as reported by Gatti and colleagues; summarize the most important similarities and differences.

(*4 points available, max 150 words*)

*type your answer here*

**4c.** Do you think the performance of the fastText featurization would change if you were to use different n-grams? Would you make them smaller or larger? Justify your answer.

(*4 points available, max 150 words*)

*type your answer here*

**4d.** Do you think that training the same models on uni-grams, bi-grams, fastText and transformer-based embeddings but using valence ratings for Finnish (a language which uses the same alphabet as English but is not a IndoEuropean language) words would yield a similar pattern of results? Justify your answer.

(*4 points available, max 150 words*)

*type your answer here*

**Task 5** (*3 points available*)

Compute the average Levenshtein Distance (aLD) between each pseudoword and the 20 words at the smallest edit distance from it. Consider the set of words you used to filter out pseudowords that happen to be valid Dutch words (the file is available in this OSF repository: https://osf.io/9zymw/) to retrieve the 20 words at the smallest edit distance.

In [None]:
# compute the average Levenshtein distance from each pseudoword to the words used to filter out pseudowords.
# Show the aLD estimate for the pseudowords 'nedukes', 'pewbin', and 'vibcines'
# 3 points for correctly computing aLD for pseudowords

**Task 6** (*3 points available*)

For each pseudoword, record the number of tokens in which RobBERT v2 encodes it.

In [None]:
# record the number of tokens in which RobBERT divides each pseudoword
# show the number of tokens for the pseudowords 'yuxwas', 'skibfy', and 'errords'
# 3 points for correctly mapping pseudowords to number of tokens

**Task 7** (*5 points available, see breakdown below*)

Compute the residuals of the predicted valence under the four regressors trained and applied in tasks 2 to 4. Then, correlate the residuals from all four models with aLD. Finally, correlate the residuals from the RobBERT v2 model with the number of tokens in which each pseudoword is split. Use the Pearson's correlation coefficient.

In [None]:
# compute the residuals from all four regression models fitted before
# 1 point available for correctly computing residuals

In [None]:
# compute the Pearson's correlation between residuals and average LD for all models,
# as well as the correlation between RobBERT v2 residuals and the number of tokens in which each pseudoword
#    is encoded by the RobBERT v2 model.
# show all correlation coefficients
# 4 points for the correct correlation coefficients

**Task 8** What is the relation between the errors each model made and aLD? what about the number of tokens (limited to the RobBERT v2 model)?

(*4 points available, max 150 words*)

*testo in corsivo*