# Predicting Pseudoword Valence Using Cross-Linguistic Regression Models

## Replicating and Extending *Valence Without Meaning* (Gatti et al., 2024)

Imad-eddine El Bakiouli  
BSc Cognitive Science & Artificial Intelligence  
Tilburg University | 2025

This project extends the work of Gatti et al. by examining whether form–meaning mappings learned from a related language can capture the perceived valence of pseudowords. To investigate this, multiple linguistic resources were integrated and the original methodological pipeline was adapted accordingly.

In [82]:
!git clone https://github.com/MilaNLProc/psycho-embeddings.git
%cd psycho-embeddings
!pip install datasets

Cloning into 'psycho-embeddings'...
remote: Enumerating objects: 199, done.[K
remote: Counting objects: 100% (199/199), done.[K
remote: Compressing objects: 100% (138/138), done.[K
remote: Total 199 (delta 105), reused 141 (delta 53), pack-reused 0 (from 0)[K
Receiving objects: 100% (199/199), 67.91 KiB | 556.00 KiB/s, done.
Resolving deltas: 100% (105/105), done.
/content/psycho-embeddings/psycho-embeddings/psycho-embeddings/psycho-embeddings


In [83]:
# the solution to the assignment has been obtained using these packages.
import nltk
import torch
import Levenshtein
import numpy as np
import pandas as pd
import pickle as pkl
import fasttext.util
import fasttext as ft
from tqdm import tqdm
from scipy.stats import pearsonr
from scipy.stats import spearmanr
from collections import defaultdict
from transformers import AutoTokenizer
from sklearn.linear_model import LinearRegression
from psycho_embeddings import ContextualizedEmbedder
from sklearn.feature_extraction.text import CountVectorizer
from transformers import RobertaTokenizer, RobertaForSequenceClassification


This project replicates the design proposed in *Valence without meaning* (Gatti et al., 2024). Linear regression models were trained using crowd-sourced Dutch valence ratings (Speed & Brysbaert, 2024) and applied to predict the valence of English pseudowords.

To train the regression models, the dataset by Speed and Brysbaert (2024) was used, which contains crowd-sourced valence ratings for approximately 24,000 Dutch words. The metadata was usde to identify the relevant columns for analysis. The dataset is described in the paper *Ratings of valence, arousal, happiness, anger, fear, sadness, disgust, and surprise for 24,000 Dutch words* (Speed & Brysbaert, 2024).

Two character-level models were implemented: a letter unigram model and a letter bigram model. Both models were trained exclusively on Dutch words.

One important issue addressed in this project is that pseudowords created for English may correspond to valid Dutch words. To prevent this, the pseudoword list was filtered against a large store of Dutch words using the Dutch prevalence lexicon (OSF repository: https://osf.io/9zymw/). Any pseudoword for which a Dutch prevalence estimate was available was excluded, regardless of the prevalence value.

In [98]:
# Load pseudowords (Gatti et al.) and Dutch valence norms (Speed & Brysbaert, 2024) from Google Drive
# Extract pseudoword column and inspect both datasets

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

gattifile = pd.read_csv("/content/drive/MyDrive/Gatti_pseudowords_1500.csv")
print("First 5 lines of the dataset from Gatti:")
print(gattifile.head())
pseudowords = gattifile["pseudoword"]

xlsx_path = '/content/drive/MyDrive/SpeedBrysbaertEmotionNorms.xlsx'
valence = pd.read_excel(xlsx_path)

print("First 5 lines of the data set from Speed and Brysbaert:")
print(valence.head())




Mounted at /content/drive
First 5 lines of the dataset from Gatti:
   X pseudoword     Value  predicted_valence  predictedL_valence  \
0  1     abhert  0.452501           7.414814            5.116167   
1  2     abhict  0.434171           8.233714            5.059183   
2  3     acleat  0.527803           5.552468            5.262971   
3  4     acmure  0.604889           8.714640            5.120029   
4  5      acoed  0.538990           7.340002            5.115652   

   predictedL_Bi_valence  predicted_Dim_valence  predictedL_Dim_valence  \
0               6.444633               6.783771                6.630497   
1               6.509936               7.366068                7.377534   
2               5.245826               5.268643                5.396114   
3               6.562896               7.680827                7.583230   
4               5.309727               7.105662                7.024771   

   predictedBi_Dim_valence  predictedBi_valence  LDist  Ortho_VAL  \
0   

In [99]:
# Filter pseudowords against the Dutch prevalence lexicon to remove items
# that are valid Dutch words; retain only true pseudowords for analysis

print("Original pseudoword count:", len(gattifile))
prev = pd.read_csv(
    '/content/drive/MyDrive/prevalence_netherlands.csv',
    sep='\t'
)

valid_words = set(prev['word'].astype(str).str.lower())

pws = gattifile['pseudoword'].astype(str)
pws_low = pws.str.lower()

mask_real = pws_low.isin(valid_words)
removed = set(pws[mask_real])
print("Pseudowords removed:", removed)

filtered_gatti = gattifile.loc[~mask_real].reset_index(drop=True)
print("Remaining pseudoword count:", len(filtered_gatti))

pseudowords_list = filtered_gatti['pseudoword'].str.lower().tolist()



Original pseudoword count: 1500
Pseudowords removed: {'pimpen'}
Remaining pseudoword count: 1499


In [100]:
# Generate character unigram (1-gram) and bigram (2-gram) feature matrices
# for Dutch training words and pseudowords

dutch_words = valence["Word"].str.lower().tolist()

uni_vec = CountVectorizer(analyzer="char", ngram_range=(1, 1))
bi_vec  = CountVectorizer(analyzer="char", ngram_range=(2, 2))

X_dutch_uni = uni_vec.fit_transform(dutch_words)
X_dutch_bi  = bi_vec.fit_transform(dutch_words)
X_pseudo_uni = uni_vec.transform(pseudowords_list)
X_pseudo_bi  = bi_vec.transform(pseudowords_list)

ex = ["ampgrair"]
uni_ex = uni_vec.transform(ex).toarray()[0]
bi_ex  = bi_vec.transform(ex).toarray()[0]

print("\nUnigram vector for 'ampgrair':")
print(uni_ex)
print("\nBigram vector for 'ampgrair':")
print(bi_ex)



Unigram vector for 'ampgrair':
[2 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Bigram vector for 'ampgrair':
[0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

In [102]:
# Train linear regression models on unigram and bigram features

y = valence["Valence"].values

uni_model = LinearRegression().fit(X_dutch_uni, y)

bi_model = LinearRegression().fit(X_dutch_bi, y)

uni_r2 = uni_model.score(X_dutch_uni, y)
bi_r2 = bi_model.score(X_dutch_bi, y)

print(f"Unigram model R2 score: {uni_r2:.4f}")
print(f"Bigram model R2 score: {bi_r2:.4f}")

Unigram model R2 score: 0.0095
Bigram model R2 score: 0.1160


In [103]:
# Generate valence predictions for pseudowords and Dutch words

y_pseudo_pred_uni = uni_model.predict(X_pseudo_uni)
y_pseudo_pred_bi  = bi_model.predict(X_pseudo_bi)

y_train_pred_uni = uni_model.predict(X_dutch_uni)
y_train_pred_bi  = bi_model.predict(X_dutch_bi)

print("Predicted valence for pseudowords (unigram):", y_pseudo_pred_uni)
print("Predicted valence for pseudowords (bigram) :", y_pseudo_pred_bi)
print("\nPredicted valence for Dutch words (unigram):", y_train_pred_uni)
print("Predicted valence for Dutch words (bigram) :", y_train_pred_bi)

Predicted valence for pseudowords (unigram): [2.91964934 2.93271094 3.00193988 ... 2.83886063 2.84023961 2.79305788]
Predicted valence for pseudowords (bigram) : [3.23914867 3.04290828 3.17845976 ... 2.86050361 3.00660316 3.05633199]

Predicted valence for Dutch words (unigram): [2.97356517 3.04873071 2.99644161 ... 2.96466515 2.95004878 2.90939766]
Predicted valence for Dutch words (bigram) : [3.07687701 2.86048787 3.14367623 ... 2.79148374 2.6557058  2.96608726]


In [121]:
# Compute Spearman correlations between predicted and observed valence

r_uni_train, _ = spearmanr(y, y_train_pred_uni)
r_bi_train,  _ = spearmanr(y, y_train_pred_bi)

y_pseudo_true = filtered_gatti["Value"].values

y_pseudo_pred_uni = y_pseudo_pred_uni[:len(y_pseudo_true)]
y_pseudo_pred_bi  = y_pseudo_pred_bi[:len(y_pseudo_true)]

r_uni_pseudo, _ = spearmanr(y_pseudo_true, y_pseudo_pred_uni)
r_bi_pseudo,  _ = spearmanr(y_pseudo_true, y_pseudo_pred_bi)

print(f"Spearman correlation (unigram) of Dutch words:    p = {r_uni_train:.3f}")
print(f"Spearman correlation (bigram)  of Dutch words:    p = {r_bi_train:.3f}")
print(f"Spearman correlation (unigram) of pseudowords:    p = {r_uni_pseudo:.3f}")
print(f"Spearman correlation (bigram)  of pseudowords:    p = {r_bi_pseudo:.3f}")

Spearman correlation (unigram) of Dutch words:    p = 0.090
Spearman correlation (bigram)  of Dutch words:    p = 0.321
Spearman correlation (unigram) of pseudowords:    p = 0.260
Spearman correlation (bigram)  of pseudowords:    p = 0.101


Following the approach of Gatti et al., the target strings (pseudowords and Dutch words from Speed and Brysbaert) were encoded using fastText embeddings. A multiple regression model was trained on the Dutch words and subsequently applied to the pseudowords introduced by Gatti et al.

Model performance was evaluated by reporting the Spearman correlation coefficient between observed and predicted valence scores for both Dutch words and pseudowords.

Pre-trained fastText embeddings for Dutch were used, obtained from the official fastText repository (https://fasttext.cc/docs/en/crawl-vectors.html).

In addition, the properties and implications of the fastText embedding model were analysed to better understand its role in capturing form–meaning relationships.

In [90]:
import fasttext.util
ft = fasttext.load_model("/content/drive/MyDrive/cc.nl.300.bin")


### FastText Model Properties

The pre-trained Dutch fastText embeddings have a dimensionality of 300.

The model was trained using character n-grams ranging from length 3 to 6.

In [110]:
# Encode Dutch words and pseudowords using pre-trained fastText embeddings

X_ft = np.vstack([ft.get_word_vector(w) for w in dutch_words])
X_pw_ft = np.vstack([ft.get_word_vector(w) for w in pseudowords_list])

real_word   = "speelplaats"
pseudo_word = "danchunk"

vec_real   = ft.get_word_vector(real_word)
vec_pseudo = ft.get_word_vector(pseudo_word)

print(f"First 20 values of '{real_word}':\n{vec_real[:20]}\n")
print(f"First 20 values of '{pseudo_word}':\n{vec_pseudo[:20]}")

First 20 values of 'speelplaats':
[ 0.0253247  -0.00634261  0.02746305 -0.04024595  0.04888906  0.00660965
 -0.04152017 -0.01824508 -0.00645641  0.00093806  0.0708492  -0.03291791
  0.00263817 -0.02825846 -0.02188046 -0.03188037 -0.01846142 -0.02203094
 -0.01883078 -0.00259199]

First 20 values of 'danchunk':
[-0.00592199  0.00097547  0.05925412  0.00053251 -0.00386978 -0.02089076
 -0.02829577  0.00972911 -0.02510111 -0.11454885 -0.02695064  0.01551034
  0.02384409  0.01009528  0.04545438  0.00997385 -0.00474529  0.02524533
  0.02430548 -0.02851078]


In [108]:
# Train regression model on word valence

ft_reg = LinearRegression().fit(X_ft, y)

print("FastText‐based regression R2 on Dutch valence:", ft_reg.score(X_ft, y))

FastText‐based regression R2 on Dutch valence: 0.5200443073017833


In [118]:
# Apply the trained model to predict the valence of pseudowords from Gatti et al (2024).
# Then apply the same model back onto the training set to see how well it predicts the valence of words in Speed and Brysbaert (2024).

y_pw_pred_ft   = ft_reg.predict(X_pw_ft)
y_train_pred_ft = ft_reg.predict(X_ft)

print("Predicted valence for pseudowords (FastText):", y_pw_pred_ft)
print("Predicted valence for Dutch words (FastText):", y_train_pred_ft)


Predicted valence for pseudowords (FastText): [3.1155238 3.0766866 2.8852634 ... 3.0173347 2.9208107 2.9239726]
Predicted valence for Dutch words (FastText): [4.3099604 2.865531  4.036543  ... 3.1800575 2.8078961 3.1230912]


In [122]:
# compute the Spearman correlation coefficients between true valence and predicted valence for
# - words from Speed and Brysbaert (2024)
# - pseudowords from Gatti and colleagues (2024)

r_dutch, p_dutch = spearmanr(y, y_train_pred_ft)
r_pseudo, p_pseudo = spearmanr(y_pseudo_true, y_pw_pred_ft)

print(f"Spearman correlation of Dutch words:      p = {r_dutch:.3f}")
print(f"Spearman correlation of pseudowords :     p = {r_pseudo:.3f}")


Spearman correlation of Dutch words:      p = 0.724
Spearman correlation of pseudowords :     p = 0.102


### Transformer-Based Valence Prediction (RobBERT v2)

In addition to character-based models, this project extends the analysis by incorporating contextual representations from a transformer-based model, specifically RobBERT v2 (https://huggingface.co/pdelobelle/robbert-v2-dutch-base).

Following the same evaluation pipeline, Dutch words (Speed & Brysbaert, 2024) and pseudowords (Gatti et al.) were encoded using RobBERT embeddings extracted from layer 0, prior to the integration of positional information. For strings consisting of multiple tokens, token embeddings were averaged to obtain a single representation per string.

A multiple regression model was trained on the valence ratings of Dutch words and subsequently applied to the pseudowords. Model performance was evaluated using the Spearman correlation coefficient between observed and predicted valence scores.

Due to the computational cost of embedding thousands of strings, embeddings were generated incrementally and stored for reuse once verified. During development, smaller subsets of words were used to validate correctness before scaling to the full dataset.

In [114]:
# load and instantiate the right model

tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robbert-v2-dutch-base")
model = RobertaForSequenceClassification.from_pretrained("pdelobelle/robbert-v2-dutch-base")


loading file vocab.json from cache at /root/.cache/huggingface/hub/models--pdelobelle--robbert-v2-dutch-base/snapshots/271b8bf12b7e429434ce953efb432e8373e84453/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--pdelobelle--robbert-v2-dutch-base/snapshots/271b8bf12b7e429434ce953efb432e8373e84453/merges.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--pdelobelle--robbert-v2-dutch-base/snapshots/271b8bf12b7e429434ce953efb432e8373e84453/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--pdelobelle--robbert-v2-dutch-base/snapshots/271b8bf12b7e429434ce953efb432e8373e84453/tokenizer_config.json
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--pdelobelle--robbert-v2-dutch-base/snapshots/271b8bf12b7e429434ce953efb432e8373e84453/tokenizer.json
loading file chat_template.jinja from c

In [115]:
# encode the words and pseudowords using RobBERT v2.

def chunks(lst, n):
    """Chunks a list into equal chunks containing n elements. Returns a list of lists."""
    chunked = []
    for i in range(0, len(lst), n):
        chunked.append(lst[i : i + n])
    return chunked

def embed_layer0(texts, batch_size=64):
    embs = []
    for batch in chunks(texts, batch_size):
        enc = tokenizer(batch, return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            lookup = model.roberta.embeddings.word_embeddings(enc["input_ids"])
        core = lookup[:, 1:-1, :]
        embs.append(core.mean(dim=1))
    return torch.cat(embs, dim=0)

X_rob_dutch = embed_layer0(dutch_words,      batch_size=128)
X_rob_pws   = embed_layer0(pseudowords_list, batch_size=128)

example_vecs = embed_layer0(["miauwen", "lixtheless"], batch_size=2)
print("First 20 values of 'miauwen':  ", example_vecs[0,:20].tolist())
print("First 20 values of 'lixtheless':", example_vecs[1,:20].tolist())


First 20 values of 'miauwen':   [0.01574459858238697, -0.050622452050447464, -0.013598739169538021, 0.008594965562224388, -0.025512464344501495, 0.05335233733057976, 0.07680433243513107, -0.0515216588973999, 0.05702745541930199, 0.015662498772144318, -0.006799475289881229, -0.04346112906932831, 0.006691428832709789, 0.020613234490156174, -0.011536190286278725, 0.06474956125020981, 0.010083496570587158, -0.00346448365598917, 0.024953659623861313, -0.016500618308782578]
First 20 values of 'lixtheless': [-0.018898235633969307, 0.06503724306821823, -0.06651163846254349, 0.04543512314558029, -0.002323267050087452, 0.011083191260695457, 0.00870988517999649, -0.031088024377822876, 0.008913702331483364, -0.02554977312684059, -0.009220123291015625, -0.011074005626142025, -0.07307430356740952, 0.05849800258874893, -0.004880142398178577, -0.03039686381816864, 0.02213442325592041, -0.01936710998415947, 0.07086442410945892, -0.031020576134324074]


In [116]:
# train regression model on word valence estimates from Speed and Brysbaert (2024)

rob_reg = LinearRegression().fit(X_rob_dutch, y)

print("RobBERT regression R² on Dutch valence:", rob_reg.score(X_rob_dutch, y))

RobBERT regression R² on Dutch valence: 0.29015233642237737


In [119]:
# apply the trained model to predict the valence of pseudowords from Gatti et al (2024).
# Then apply the same model back onto the training set to see how well it predicts the valence of words in Speed and Brysbaert (2024).

y_pw_pred_rob    = rob_reg.predict(X_rob_pws.numpy())
y_train_pred_rob = rob_reg.predict(X_rob_dutch.numpy())

print("Predicted valence for pseudowords:", y_pw_pred_rob)
print("Predicted valence for Dutch words:", y_train_pred_rob)



Predicted valence for pseudowords: [2.9676025 2.8038704 2.6597714 ... 3.0554996 2.9347763 2.6780963]
Predicted valence for Dutch words: [3.2607572 3.0649645 3.0775404 ... 2.7749772 2.9288225 3.0169504]


In [123]:
# compute the Spearman correlation coefficients between true valence and predicted valence for
# - words from Speed and Brysbaert (2024)
# - pseudowords from Gatti and colleagues (2024)
# show the correlation coefficient

r_rob_train, _ = spearmanr(y, y_train_pred_rob)
r_rob_pws, _  = spearmanr(y_pseudo_true, y_pw_pred_rob)

print(f"Spearman correlation of Dutch words:    ρ = {r_rob_train:.3f}")
print(f"Spearman correlation of pseudowords:    ρ = {r_rob_pws:.3f}")


Spearman correlation of Dutch words:    ρ = 0.515
Spearman correlation of pseudowords:    ρ = 0.169




### Model Performance Comparison

The performance of each featurization was analysed by comparing:
- the performance of the same model between the training and test set,
- the performance of different models on the training set,
- the performance of different models on the test set.

In the training set, the unigram model explains almost no variance (R² = 0.0095, ρ = 0.090), yet it generalizes best to pseudowords (ρ = 0.260). Bigrams capture more orthographic structure (R² = 0.1160, ρ = 0.321), but drop to ρ ≈ 0.101 on novel strings. FastText subword embeddings dominate in-sample (R² = 0.5200, ρ = 0.724), but collapse to ρ ≈ 0.102 out-of-sample. RobBERT layer-0 embeddings sit in between (R² = 0.2982, ρ = 0.515), with moderate generalization (ρ ≈ 0.169).

Across models on training words, performance ranks as follows: 
1. fastText  
2. RobBERT  
3. bigrams  
4. unigrams  

On pseudowords, the ranking changes:
1. unigrams  
2. RobBERT  
3. fastText & bigrams  

These results indicate that while complex subword representations fit real-word valence extremely well, they overfit and struggle with novel letter strings, whereas simpler letter-frequency cues transfer more robustly.


### Cross-Linguistic Comparison with English Results

The overall pattern observed in Dutch closely mirrors the English results reported by Gatti et al. (2023). In English, a linear model on single letters achieved only r = .11; adding bigrams raised this to r = .33, and a full fastText-enriched model reached r = .80.

In Dutch, a similar progression is observed: very low correlation for unigrams (ρ = .09), a substantial increase for bigrams (ρ = .32), and the highest performance for fastText (ρ = .72).

Thus, in both languages:
1. Letter-only features capture minimal valence.
2. Orthographic context (bigrams) yields a medium effect.
3. Subword-informed embeddings provide the largest improvement.

The main quantitative difference is that all Dutch correlations are slightly lower than their English counterparts — particularly fastText (ρ = .72 vs. r = .80) — suggesting somewhat weaker predictive regularity in Dutch valence norms.

### Effect of N-gram Size in fastText

FastText relies on subword n-grams, typically ranging from 3 to 6 characters. These enable strong in-sample performance by capturing meaningful orthographic structure. However, such representations do not generalize as effectively to pseudowords.

Using shorter n-grams (e.g., 3–4 characters) could improve generalization to novel, made-up words by focusing on smaller orthographic units, although this might slightly reduce fit on real words.

Conversely, extending the n-gram range (e.g., up to 8 characters) could further improve modeling of real Dutch words but may increase overfitting and worsen generalization to unseen strings.

Given the observed drop in performance on pseudowords, slightly shorter n-grams could potentially yield a better balance between in-sample accuracy and generalization.

**4d.** Do you think that training the same models on uni-grams, bi-grams, fastText and transformer-based embeddings but using valence ratings for Finnish (a language which uses the same alphabet as English but is not a IndoEuropean language) words would yield a similar pattern of results? Justify your answer.

(*4 points available, max 150 words*)

### Expected Performance in Finnish

Training the same models on Finnish valence ratings would likely yield a similar ranking pattern: fastText performing best, followed by transformer-based embeddings, then bigrams, and finally unigrams.

However, the performance gaps between models may be larger in Finnish. Finnish words are typically longer and morphologically complex, which could benefit subword-aware models like fastText. At the same time, such complexity may further challenge simple letter-based models.

Transformer models, such as RobBERT, may handle morphological variation more effectively due to contextual representation learning. Bigrams would likely struggle with highly inflected forms, and unigrams would remain the weakest representation.

Overall, the ranking may remain stable, but differences between models would likely become more pronounced.



To further examine the orthographic similarity between pseudowords and existing Dutch vocabulary, the average Levenshtein Distance (aLD) was computed for each pseudoword relative to the 20 closest Dutch words at the smallest edit distance.

The Dutch prevalence lexicon used earlier for filtering valid words was employed to retrieve the nearest lexical neighbors. For each pseudoword, distances to all valid Dutch words were computed, the 20 smallest distances were selected, and their average was calculated.


In [124]:
# compute the average Levenshtein distance from each pseudoword to the words used to filter out pseudowords.
# Show the aLD estimate for the pseudowords 'nedukes', 'pewbin', and 'vibcines'

targets = ['nedukes', 'pewbin', 'vibcines']

for pw in targets:
    dists = [Levenshtein.distance(pw, w) for w in valid_words]
    nearest20 = sorted(dists)[:20]
    aLD = sum(nearest20) / 20
    print(f"{pw}: aLD = {aLD:.3f}")


nedukes: aLD = 2.900
pewbin: aLD = 2.950
vibcines: aLD = 3.550


In [125]:
# record the number of tokens in which RobBERT divides each pseudoword
# show the number of tokens for the pseudowords 'yuxwas', 'skibfy', and 'errords'

pseudowords = ["yuxwas", "skibfy", "errords"]

for pw in pseudowords:
    token_ids = tokenizer.encode(pw, add_special_tokens=False)
    tokens    = tokenizer.convert_ids_to_tokens(token_ids)
    print(f"{pw:8s} → {len(tokens)} tokens: {tokens}")

yuxwas   → 3 tokens: ['y', 'ux', 'was']
skibfy   → 4 tokens: ['sk', 'ib', 'f', 'y']
errords  → 3 tokens: ['er', 'ror', 'ds']


In [126]:
# compute the residuals from all four regression models fitted before
y_true = filtered_gatti["Value"].values

y_uni = y_pseudo_pred_uni[: len(y_true)]
y_bi  = y_pseudo_pred_bi[: len(y_true)]
y_ft  = y_pw_pred_ft[:   len(y_true)]
y_rob = y_pw_pred_rob[:  len(y_true)]

res_uni = y_true - y_uni
res_bi  = y_true - y_bi
res_ft  = y_true - y_ft
res_rob = y_true - y_rob

In [127]:
# compute the Pearson's correlation between residuals and average LD for all models,
# as well as the correlation between RobBERT v2 residuals and the number of tokens in which each pseudoword
# is encoded by the RobBERT v2 model.
# show all correlation coefficients

aLD_list = []
for pw in pseudowords_list:
    dists = [Levenshtein.distance(pw, w) for w in valid_words]
    nearest20 = sorted(dists)[:20]
    aLD_list.append(sum(nearest20) / 20)
aLD_array = np.array(aLD_list)

token_counts_list = [
    len(tokenizer.encode(pw, add_special_tokens=False))
    for pw in pseudowords_list
]
token_counts = np.array(token_counts_list)


for name, res in zip(
    ["Unigram", "Bigram", "fastText", "RobERT"],
    [res_uni, res_bi, res_ft, res_rob]
):
    r, p = pearsonr(res, aLD_array)
    print(f"{name:8s} residual vs aLD → r = {r:.3f}, p = {p:.3f}")
r_tok, p_tok = pearsonr(res_rob, token_counts)
print(f"RobERT residual vs tokens → r = {r_tok:.3f}, p = {p_tok:.3f}")

Unigram  residual vs aLD → r = -0.117, p = 0.000
Bigram   residual vs aLD → r = -0.109, p = 0.000
fastText residual vs aLD → r = -0.094, p = 0.000
RobERT   residual vs aLD → r = -0.017, p = 0.509
RobERT residual vs tokens → r = 0.120, p = 0.000


## Relationship Between Model Errors, Edit Distance, and Tokenization

The unigram, bigram, and fastText models each show a significant negative correlation between their residuals and average Levenshtein Distance (aLD) (r ≈ −.117, −.109, −.049; p < .001). This indicates that the further a pseudoword is from any real Dutch word, the more these models tend to overestimate its valence.

RobBERT v2, however, shows no meaningful relationship between residuals and aLD (r = −.017; p = .51), suggesting that its prediction errors are not primarily driven by simple orthographic unfamiliarity.

Within RobBERT, residuals correlate positively with the number of subword tokens (r = .120; p < .001). This suggests that pseudowords segmented into more subword units tend to have their valence underestimated.