# A2: Vector Semantics

By Nikolai Ilinykh, Mehdi Ghanimifard, Wafia Adouane and Simon Dobnik. Updated in 2025 by Ricardo Muñoz Sánchez

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Write all your answers and the code in the appropriate boxes below.

In this lab we will look at how to build distributional semantic models from corpora and use semantic similarity captured by these models to do semantic tasks. We are also going to examine how different vector composition functions for vectors work in approximating semantic similarity of phrases when compared to human judgements.

This lab uses code from a file called `dist_erk.py` which contains functions similar to those shown in the lecture. You can use either set of functions to solve these tasks.

In [None]:
# The code for dist_erk.py uses both Spacy and NLTK, so make sure to have them installed!
# Our code also uses SciPY and scikit-learn, so you'll need to install it as well.
# If you're unsure how to do this, check out these websites:
### https://scipy.org/beginner-install/
### https://scikit-learn.org/stable/install.html
### https://spacy.io/usage
### https://www.nltk.org/install.html


# We also need to make sure we have the necessary models and datasets for Spacy
import spacy
spacy.cli.download('en_core_web_sm')
spacy.cli.download('en_core_web_lg')
spacy.cli.download('en_core_web_trf')

# You only need to run this cell once
# You *need* to restart the kernel after downloading the model!

In [3]:
# the following command simply imports all the methods from the dist_erk file
from dist_erk import *

## 1. Loading a corpus

To train a distributional model, we first need a sufficiently large collection of texts which contain different words used frequently enough in different contexts. Here we will use a section of the Wikipedia corpus `wikipedia.txt` stored in `wikipedia.zip`. This file has been borrowed from another lab by [Richard Johansson](http://www.cse.chalmers.se/~richajo/).

When unpacked, the file is 151mb, hence if you are using the MLT servers you should store it in a temporary folder outside your home and adjust the `corpus_dir` path below. It may already exist in `/srv/data/computational-semantics/`.

In [4]:
# corpus_dir = './wikipedia/'
corpus_dir = '/srv/data/computational-semantics/assignments/wikipedia/'


## 2. Building a model

Now you are ready to build the model.  
Using the methods from the code imported above build three word matrices with 1000 dimensions as follows:  

(i) with raw counts (saved to a variable `space_1k`);  
(ii) with PPMI (`ppmispace_1k`);  
(iii) with reduced dimensions SVD (`svdspace_1k`).  
For the latter use `svddim=5`. **[5 marks]**

Your task is to replace `...` with function calls to functions from `dist_erk.py` which are similar to functions shown during the lecture.

Do not despair if the code takes a bit long to run!
It took me about 9 minutes for the cell below.

In [None]:
import numpy as np

numdims = 1000
svddim = 5

# Which words to use as targets and context words?
# We need to count the words and keep only the N most frequent ones
# Which function would you use here with which variable?
ktw = do_word_count(corpus_dir, numdims)

wi = make_word_index(ktw) # word index
words_in_order = sorted(wi.keys(), key=lambda w:wi[w]) # sorted words

# Create different spaces (the original matrix space, the ppmi space, the svd space)
# Which functions with which arguments would you use here?
space_1k = make_space(corpus_dir, wi, numdims)


ppmispace_1k = ppmi_transform(space_1k, wi)


svdspace_1k = svd_transform(ppmispace_1k, numdims, 5)


reading file wikipedia.txt
create count matrices
reading file wikipedia.txt


1145485it [01:30, 12678.93it/s]


ppmi transform
svd transform
done.


In [6]:
# now, to test the space, you can print vector representation for some words
print('house:', space_1k['house'])

house: [2551 3714 3104  567  962  627  443  185  311  189  131   28   93  169
   81  125  151  408  194   89   79   29  217  184   62   15   31   70
   10    1   41   21    1   31   37    1   30    5   25    7    3   20
   11    1   32   36    2    5   65    4    0   46    8   18   28    0
   20    7    8   16   10   40    0  175   10    2    7   19    1  174
   11    3    1    6    0    0    0   10    9   11    7   24    4    4
   14   23   58    7    0   10    2    3   10    6   18    6   13    3
   22    0    3    5    3    7   14    3   40   20   19   15    6    8
   23    4    5    1   19    0    3    1    0   14    0   14   53    7
    7   11    6    5    5    4   12    6   53    1    1  433    4    0
    5    7    7   12    1    1    3    4   17    8   16    1    2   31
    1   12   14    1   44    6   14    9   38    7    2    6    8    1
   10    6   10    1    9    7    9    4    3    9    0   11    3    2
    0    2   11   37    2    0    2    1    5    9   10   16    4    6

Oxford Advanced Dictionary has 185,000 words, hence 1,000 words is not representative. We trained a model with 10,000 words, and 50 dimensions on truncated SVD. All matrices are available in the folder `pretrained` of the `wikipedia.zip`file. These are `ktw_wikipediaktw.npy`, `raw_wikipediaktw.npy`, `ppmi_wikipediaktw.npy`, `svd50_wikipedia10k.npy`. Make sure they are in your path as we load them below.

In [None]:
import numpy as np

numdims = 10000
svddim = 50

ktw_10k       = np.load(f'{corpus_dir}/pretrained/ktw_wikipediaktw.npy', allow_pickle=True)
space_10k     = np.load(f'{corpus_dir}/pretrained/raw_wikipediaktw.npy', allow_pickle=True).tolist()
ppmispace_10k = np.load(f'{corpus_dir}/pretrained/ppmi_wikipediaktw.npy', allow_pickle=True).tolist()
svdspace_10k  = np.load(f'{corpus_dir}/pretrained/svd50_wikipedia10k.npy', allow_pickle=True).tolist()



Please wait...
Done.


In [None]:
# Testing semantic space
print('house:', space_10k['house'])

house: [2554 3774 3105 ...    0    0    0]


## 3. Testing semantic similarity

The file `similarity_judgements.txt` contains 7,576 pairs of words and their lexical and visual similarities (based on the pictures) collected through crowd-sourcing using Mechanical Turk as described in [1]. The scores range from 1 (highly dissimilar) to 5 (highly similar). Note: this is a different dataset from the phrase similarity dataset we discussed during the lecture [2]. You can find more details about how they were collected in the papers.

The following code will transform similarity scores into a Python-friendly format:

In [None]:
word_pairs = [] # Test suit word pairs
semantic_similarity = [] 
visual_similarity = []
test_vocab = set()

for index, line in enumerate(open('similarity_judgements.txt')):
    data = line.strip().split('\t')
    if index > 0 and len(data) == 3:
        w1, w2 = tuple(data[0].split('#'))
        # Checks if both words from each pair exist in the word matrix.
        if w1 in ktw_10k and w2 in ktw_10k:
            word_pairs.append((w1, w2))
            test_vocab.update([w1, w2])
            semantic_similarity.append(float(data[1]))
            visual_similarity.append(float(data[2]))
        
print('number of available words to test:', len(test_vocab-(test_vocab-set(ktw_10k))))
print('number of available word pairs to test:', len(word_pairs))
#list(zip(word_pairs, visual_similarity, semantic_similarity))

number of available words to test: 155
number of available word pairs to test: 774


We are going to test how the cosine similarity between vectors of each of the three spaces (normal space, ppmi, svd) compares with the human similarity judgements for the words in the similarity dataset. Which of the three spaces best approximates human judgements?

For comparison of several scores, we can use [the Spearman correlation coefficient](https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient) which is implemented in `scipy.stats.spearmanr` [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html). The values of the Sperman correlation coefficient range from -1, 0 to 1, where 0 indicates no correlation, 1 perfect correaltion and -1 negative correlation. Hence, the greater the number the better the similarity scores align. The p values tells us if the coefficient is statistically significant. For this to be the case, it must be less than or equal to $< 0.05$.

Here is how you can calculate the Spearman correlation coefficient betweeen the scores of visual similarity and semantic similarity of the available words in the test suite:

In [11]:
from scipy import stats

rho, pval = stats.spearmanr(semantic_similarity, visual_similarity)
print("""Visual Similarity vs. Semantic Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))

Visual Similarity vs. Semantic Similarity:
rho     = 0.7122
p-value = 0.0000


Let's now calculate the cosine similarity scores of all word pairs in an ordered list using all three matrices. **[6 marks]**

In [12]:
raw_similarities  = [cosine(w1, w2, space_10k) for w1, w2 in word_pairs]
ppmi_similarities = [cosine(w1, w2, ppmispace_10k) for w1, w2 in word_pairs]
svd_similarities  = [cosine(w1, w2, svdspace_10k) for w1, w2 in word_pairs]

Calculate correlation coefficients between lists of similarity scores and the real semantic similarity scores from the experiment. The scores of what model best correlates them? Is this expected? **[6 marks]**

In [14]:
from scipy import stats

rho, pval = stats.spearmanr(semantic_similarity, raw_similarities)
print(f"Semantic Similarity vs. Raw Similarity:\nrho = {rho:.4f}\np-value = {pval:.4f}\n\n")

rho, pval = stats.spearmanr(semantic_similarity, ppmi_similarities)
print(f"Semantic Similarity vs. PPMI Similarity:\nrho = {rho:.4f}\np-value = {pval:.4f}\n\n")

rho, pval = stats.spearmanr(semantic_similarity, svd_similarities)
print(f"Semantic Similarity vs. SVD Similarity:\nrho = {rho:.4f}\np-value = {pval:.4f}\n\n")


Semantic Similarity vs. Raw Similarity:
rho = 0.1522
p-value = 0.0000


Semantic Similarity vs. PPMI Similarity:
rho = 0.4547
p-value = 0.0000


Semantic Similarity vs. SVD Similarity:
rho = 0.4232
p-value = 0.0000




**Your answer should go here:**

The **rho** above shows the Spearman rank correlation coefficient between `raw_similarity`, `ppmi_similarity`, and `svd_similarity` (normal space, ppmi, svd) with the real `semantic_similarity`. We can see that **PPMI similarity** has the highest **rho** value of **0.4547**, indicating the strongest correlation. **SVD similarity** has a **rho** of **0.4232**, also showing a high correlation.

PPMI similarity is clearly higher than the raw count method, as PPMI accounts for the context of the target word, which helps refine the similarity measure by emphasizing co-occurrence patterns that are more likely to reflect the true semantic relationships. On the other hand, **SVD similarity** is slightly lower than PPMI similarity, which might be due to the information loss during the dimensionality reduction process when using SVD on the PPMI model. The SVD approach reduces the vector space dimensions to capture the most important latent features, but it can result in a loss of some fine-grained context, reducing the overall similarity.

The **p-value** for all correlations is less than **0.05**, indicating that the observed correlations are statistically significant.


We can also calculate correlation coefficients between lists of cosine similarity scores and the real visual similarity scores from the experiment. Which similarity model best correlates with them? How do the correlation coefficients compare with those from the previous comparison - and can you speculate why do we get such results? **[7 marks]**

In [15]:
from scipy import stats

rho, pval = stats.spearmanr(visual_similarity, raw_similarities)
print(f"Visual Similarity vs. Raw Similarity:\nrho = {rho:.4f}\np-value = {pval:.4f}\n\n")

rho, pval = stats.spearmanr(visual_similarity, ppmi_similarities)
print(f"Visual Similarity vs. PPMI Similarity:\nrho = {rho:.4f}\np-value = {pval:.4f}\n\n")

rho, pval = stats.spearmanr(visual_similarity, svd_similarities)
print(f"Visual Similarity vs. SVD Similarity:\nrho = {rho:.4f}\np-value = {pval:.4f}\n\n")

Visual Similarity vs. Raw Similarity:
rho = 0.1212
p-value = 0.0007


Visual Similarity vs. PPMI Similarity:
rho = 0.3838
p-value = 0.0000


Visual Similarity vs. SVD Similarity:
rho = 0.3097
p-value = 0.0000




**Your answer should go here:**

The **PPMI model** shows the strongest correlation, clearly outperforming the raw count method, followed by **SVD**, which is slightly weaker than the **PPMI model**. This trend is similar to what we observed with **Semantic Similarity**. However, the correlation between **cosine similarity scores** and **real visual similarity scores** is overall lower than the correlation between **cosine similarity scores** and **real semantic similarity scores**. This is because the correlation between **Visual Similarity** and **Semantic Similarity** is **0.7122**, indicating that lexical and visual similarities inherently differ. Visual similarity may also involve a degree of subjectivity, which could contribute to the lower correlation with real-world visual judgments.


## 4. Operations on similarities

We can perform mathematical operations on vectors to derive meaning predictions.

For example, we can perform `king - man` and add the resulting vector to `woman` and we hope to get the vector for `queen`. What would be the result of `stockholm - sweden + denmark`? Why? **[3 marks]**

If you want to learn more about vector differences between words (and words in analogy relations), check this paper [4].

**Your answer should go here:**

In the example **king - man + woman = queen** try to perform vector arithmetic to capture the relationship between words. Subtracting "man" from "king" isolates the part of the "king" vector that represents "male," and adding "woman" gives us the vector representation of "queen" because "queen" is the female counterpart to "king."

For the analogy **Stockholm - Sweden + Denmark**, we expect a vector close to **Copenhagen**, which is the capital city of Denmark. This is because **Stockholm** is the capital of **Sweden**, by subtracting "Sweden" from "Stockholm," we isolate the "capital city" part of Stockholm's vector. Adding "Denmark" introduces the country of Denmark, so the result is expected to be the capital of Denmark, **Copenhagen**.

Here is some code that allows us to calculate such comparisons.

In [37]:
from scipy.spatial import distance

def normalize(vec):
    return vec / veclen(vec)

def find_similar_to(vec1, space):
    # vector similarity funciton
    #sim_fn = lambda a, b: 1-distance.euclidean(normalize(a), normalize(b))
    #sim_fn = lambda a, b: 1-distance.correlation(a, b)
    #sim_fn = lambda a, b: 1-distance.cityblock(normalize(a), normalize(b))
    #sim_fn = lambda a, b: 1-distance.chebyshev(normalize(a), normalize(b))
    #sim_fn = lambda a, b: np.dot(normalize(a), normalize(b))
    sim_fn = lambda a, b: 1-distance.cosine(a, b)

    sims = [
        (word2, sim_fn(vec1, space[word2]))
        for word2 in space.keys()
    ]
    return sorted(sims, key = lambda p:p[1], reverse=True)

Here is how you apply this code. Comment on the results you get. **[3 marks]**

In [60]:
short = normalize(svdspace_10k['short'])
light = normalize(svdspace_10k['light'])
long = normalize(svdspace_10k['long'])
heavy = normalize(svdspace_10k['heavy'])

find_similar_to(light - (heavy - long), svdspace_10k)[:10]

[('long', 0.8733111261346902),
 ('above', 0.8259671977311956),
 ('around', 0.8030776291120686),
 ('sun', 0.7692439111243974),
 ('just', 0.767848197477811),
 ('wide', 0.7672574319922534),
 ('each', 0.7665960260861158),
 ('circle', 0.7647746702909335),
 ('length', 0.7601066921319761),
 ('almost', 0.7542351860536627)]

**Your answer should go here:**

The vector arithmetic `light - (heavy - long)`, which can be simplified to `light - heavy + long`. This operation try to blend concepts of **lightness**, **length** and the negation of **heaviness**.

The top result **long** and **length** directly reflect the arithmetic emphasis on `+ long`, since `light - heavy`, the contrary meaning subtraction may leaf not adjective to the enerty. They may also associate with **light** from sun or light with spatial or dimensional properties. The presence of "sun" and "circle" is the source of light and the shape of sun. The spatial terms **("above," "around," "wide")** are about dimensional attributes, aligning with the geometric interpretation of "length". The absence of direct antonyms for "heavy" also show challenges in isolating oppositional relationships through simple arithmetic.

The word **just，each，almost** have not semantic meaning with the operator object but usually act as qualifiers, determiners, or adverbs, often appearing alongside dimensional terms in sentences. It shows that vector arithmetic don’t perfectly isolate semantic logic. They retrieve words that share fragments of contextual overlap with the query vector, leading to noise.


Find 5 similar pairs of pairs of words and test them. Hint: google for `word analogies examples`. You can also construct analogies that are not only lexical but also express other relations such as grammatical relations, e.g. `see, saw, leave, ?` or analogies that are based on world knowledge as in `question-words.txt` from the [Google analogy dataset](http://download.tensorflow.org/data/questions-words.txt) described in [3]. Does the resulting vector similarity confirm your expectations? Remember you can only do this test if the words are contained in our vector space with 10,000 dimensions. **[10 marks]**

In [68]:

def print_analogies(target, similar_words, expression=None):
    print(f"Expression:{expression}\n")
    print(f"Target: \033[32m{target}\033[0m\n")
    for word in similar_words:
        if word[0] == target:
            print(f"\033[32m{word}\033[0m")
        else:
            print(word)
    print(f"***************************************************\n\n")

############## 1. Grammatical Category Relationship ##############
# see saw seen
# go went gone
see = normalize(svdspace_10k['see'])   
saw = normalize(svdspace_10k['saw'])
go = normalize(svdspace_10k['go'])
went = normalize(svdspace_10k['went'])

# Analogy computation: saw - see + go --> went
result_verb_tense = saw - see + go

# Find similar words to the result
similar_words = find_similar_to(result_verb_tense, svdspace_10k)[:10]
print_analogies('went', similar_words, expression="saw - see + go --> went")

# ***********
see = normalize(ppmispace_10k['see'])   
saw = normalize(ppmispace_10k['saw'])
go = normalize(ppmispace_10k['go'])
went = normalize(ppmispace_10k['went'])

# Analogy computation: saw - see + go --> went
result_verb_tense = saw - see + go

# Find similar words to the result
print("*** Using PPMI space:")
similar_words = find_similar_to(result_verb_tense, ppmispace_10k)[:10]
print_analogies('went', similar_words, expression="saw - see + go --> went")


############## 2. Spatial Relationship ##############
south = normalize(svdspace_10k['south']) 
north = normalize(svdspace_10k['north'])
inside = normalize(svdspace_10k['inside'])
outside = normalize(svdspace_10k['outside'])

# Analogy computation: south - north + inside --> outside
result_spatial = south - north + inside
# Find similar words to the result
similar_words = find_similar_to(result_spatial, svdspace_10k)[:10]
print_analogies('outside', similar_words, expression="south - north + inside --> outside")

# ***********
# Analogy computation: north - south + outside --> inside
result_spatial = north - south + outside
# Find similar words to the result
similar_words = find_similar_to(result_spatial, svdspace_10k)[:10]
print_analogies('inside', similar_words, expression="north - south + outside --> inside")

############## 3. Comparative Relationship ##############
big = normalize(svdspace_10k['big'])
bigger = normalize(svdspace_10k['bigger'])
light = normalize(svdspace_10k['light'])
lighter = normalize(svdspace_10k['lighter'])

# Analogy computation: bigger - big + light --> lighter
result_comparative = bigger - big + light

# Find similar words to the result
similar_words = find_similar_to(result_comparative, svdspace_10k)[:10]
print_analogies('lighter', similar_words, expression="bigger - big + light --> lighter")

############## 4. City-Country ##############
city1 = normalize(svdspace_10k['athens'])
country1 = normalize(svdspace_10k['greece'])
city2 = normalize(svdspace_10k['beijing'])
country2 = normalize(svdspace_10k['china'])

# Analogy computation:  city1 - country1 + country2 --> city2
result_antonym = city1 - country1 + country2

# Find similar words to the result
similar_words = find_similar_to(result_antonym, svdspace_10k)[:10]
print_analogies(f'beijing', similar_words, expression="athens - greece + china --> beijing")

#######
city1 = normalize(svdspace_10k['stockholm'])
country1 = normalize(svdspace_10k['sweden'])
city2 = normalize(svdspace_10k['copenhagen'])
country2 = normalize(svdspace_10k['denmark'])

# Analogy computation:  city1 - country1 + country2 --> city2
result_antonym = city1 - country1 + country2

# Find similar words to the result
similar_words = find_similar_to(result_antonym, svdspace_10k)[:20]
print_analogies(f'copenhagen', similar_words, expression="stockholm - sweden + denmark --> copenhagen")


############## 5. Relationship ##############
king = normalize(svdspace_10k['king'])
man = normalize(svdspace_10k['man'])
queen = normalize(svdspace_10k['queen'])
woman = normalize(svdspace_10k['woman'])

# Analogy computation: king - man + woman --> queen
result_comparative = king - man + woman

# Find similar words to the result
similar_words = find_similar_to(result_comparative, svdspace_10k)[:10]
print_analogies('queen', similar_words, expression="king - man + woman --> queen")

# *********** 
father = normalize(svdspace_10k['father'])
man = normalize(svdspace_10k['man'])
mother = normalize(svdspace_10k['mother'])
woman = normalize(svdspace_10k['woman'])

# Analogy computation: father - man + woman --> mother
result_comparative = father - man + woman

# Find similar words to the result
similar_words = find_similar_to(result_comparative, svdspace_10k)[:10]
print_analogies('mother', similar_words, expression="father - man + woman --> mother")


Expression:saw - see + go --> went

Target: [32mwent[0m

('gone', np.float64(0.7727247930029633))
('go', np.float64(0.7686350482087698))
('ahead', np.float64(0.7601887728920765))
('move', np.float64(0.7563034683136987))
('stand', np.float64(0.7553404522830045))
('stay', np.float64(0.7527578152359772))
('throw', np.float64(0.7517022141423654))
('blow', np.float64(0.7498511450771898))
('going', np.float64(0.7452173186091259))
('put', np.float64(0.7431018790922572))
***************************************************


*** Using PPMI space:
Expression:saw - see + go --> went

Target: [32mwent[0m

('go', np.float64(0.6037425322598825))
('saw', np.float64(0.589673146628872))
('get', np.float64(0.1779585939527637))
('to', np.float64(0.17562589875377843))
[32m('went', np.float64(0.1753969623869055))[0m
('going', np.float64(0.16931358410259578))
('take', np.float64(0.16441004643269452))
('you', np.float64(0.16370258486326283))
('leave', np.float64(0.1601650333492034))
('would', np.float6

**Your answer should go here:**

In case **"saw - see + go --> went"**, using **svdspace_10k**, the resulting words include **gone**, **go**, **move**, and **going**, but **went** does not shown in top 10, which is the expected word. This indicates that **SVD** loses some fine-grained details during the dimensionality reduction process. However, when using the **ppmispace_10k** model, the target word **went** appears in top 10, which aligns with the expected result. This shows that **PPMI** better captures the detailed relationships than **SVD** but sacrify the calculation. In case **"bigger - big + light --> lighter"**, The **svdspace_10k** model is able to correctly capture the grammatical rule and predict **lighter**. This result shows that **SVD** can handle certain grammatical transformations, like adjective comparison, effectively in some cases.

Incase **Spatial Relationship - "south - north + inside --> outside"**, the **svdspace_10k** model correctly predicts **outside** as the result of the operation. However, in the reverse analogy, **"north - south + outside --> inside"** the result deviates from expectations, and the output is all numeric values. The model may strong at capturing certain relationships, but struggle with reversing certain analogies or with relationships that involve more subtle semantic distinctions.

The remaining examples generally align well with the expected results, demonstrating that Vector-based Models can capture the intended relationships for most of the analogies tested.

## 5. Semantic composition and phrase similarity **[20 marks]**

In this task, we are going to examine how the composed vectors of phrases by different semantic composition functions/models introduced in [2] correlate with human judgements of similarity between phrases. We will use the dataset from this paper which is stored in `mitchell_lapata_acl08.txt`. If you are interested about further details about this task also refer to this paper.

(i) Process the dataset. The dataset contains human judgemements of similarity between phrases recorded one per line. The first column indicates the id of a participant making a judgement (`participant`), the next column is `verb`, followed by `noun` and `landmark`. From these three columns we can construct phrases that were compared by human informants, namely `verb noun` vs `verb landmark`. The next column `input` indicates a similarity score a participant assigned to a pair of such phrases on a scale from 1 to 7 where 1 is lowest and 7 is highest. The last column `hilo` groups the phrases into two sets: phrases where we expect low and phrases where we expect high similarity scores. This is because we want to test our compositional functions on two tasks and examine whether a function is discriminative between them. Correlation between scores could also be due to other reasons than semantic similarity and hence good prediction on both tasks simultaneously shows that a function is truly discriminating the phrases using some semantic criteria.

For extracting information you can use the code from the lecture to start with. How to structure this data is up to you - a dictionary-like format would be a good choice. Remember that each example was judged by several participants and phrases will repeat in the dataset. Therefore, you have to collect all judgments for a particular set of phrases and average them. This will become useful in step (iii).

(ii) Compose the vectors of the extracted word pairs by testing different compositional functions. In the lecture we introduced simple additive, simple multiplicative and combined models (details are described in [2]). Your task is to take a pair of phrases, e.g. the first example in the dataset `stray thought` and `stray roam` and for each phrase compute a composition of the vectors of their words using these functions, using one function per experiment run. For each phrase you will get a single vector. You can encode the words with any vector space introduced earlier (standard space, ppmi or svd) but your code should be structured in a way that it will be easy to switch between them. Finally, take the resulting (composed) vectors of phrase pairs in the dataset and calculate a cosine similarity between them.

(iii) Now you have cosine similairity scores between vectors of phrases but how do they compare with the average human scores that you calculated from the individual judgements from the `input` column of the dataset for the same phrases? Calculate Spearman rank correlation coefficient between two lists of the scores both for the `high` and the `low` task . 

We use the Spearmank rank correlation coefficient (or Spearman's rho) rather than Peason's correlation coefficent because we cannot compare cosine scores with human judgements directly. Cosine is a constinuous measure and human judgements are expressed as ranks. Also, we cannot say if 0.28 to 1 is the same (or different) to 6 to 7 in the human scores.  The Spearman rank correlation coeffcient turns the scores for all examples within each group first to ranks and then these ranks are correlated (or approximated to a linear function). 

In the end you should get a table similar to the one below from the paper. What is the best compositional function from those that you evaluated with your vector spaces and why?

<img src="res.png" alt="drawing" width="500"/>

Note that you might not get results in the same range as those in the paper.
That is ok, a good interpretation of results and discussion why sometimes they are not as good as you would expect is better than giving the best performing results with little to no analysis.


In [79]:
# (i) - Process the data
import pandas as pd
# Load the dataset
data_file = './mitchell_lapata_acl08.txt'
df = pd.read_csv(data_file, sep=' ', header=None, names=["participant", "verb", "noun", "landmark", "input", "hilo"], skiprows=1)

print(f"\nNumber of rows in the dataset: {len(df)}")


unique_combinations = set(zip(df['verb'], df['noun'], df['landmark']))
print(f"Number of unique combinations of 'verb', 'noun', and 'landmark': {len(unique_combinations)}")

print(f"Number of participants: {len(set(df['participant']))}")

unique_words = set(df['verb'].tolist() + df['noun'].tolist() + df['landmark'].tolist())
print(f"Number of unique words: {len(unique_words)}")
counter = 0
words_in_model = []

for word in unique_words:
    if word in ktw_10k:
        words_in_model.append(word)
        counter += 1
print(f"Number of words in pretrained wikipedia 10k model: {counter}")

unique_combinations_inmodel = {}
unique_combinations_all = {}
for comb in unique_combinations:
    unique_combinations_all[comb] = {'input': [], 'hilo': ''}
    # Check if all three words are in the 10k wikipidia model
    if comb[0] in words_in_model and comb[1] in words_in_model and comb[2] in words_in_model:
        unique_combinations_inmodel[comb] = {'input': [], 'hilo': ''}

    
        
print(f"\033[32mNumber of unique combinations found in the 10k wikipidia model: {len(unique_combinations_inmodel)}\033[0m")
print(f"\033[32mNumber of unique combinations found: {len(unique_combinations_all)}\033[0m")

# Iterate over each row in the dataframe
for _, row in df.iterrows():
    key = (row['verb'], row['noun'], row['landmark'])
    value = {'input': row['input'], 'hilo': row['hilo']}

    unique_combinations_all[key]['input'].append(value['input'])
    unique_combinations_all[key]['hilo'] = value['hilo']
    
    if key in unique_combinations_inmodel:
        unique_combinations_inmodel[key]['input'].append(value['input'])
        if unique_combinations_inmodel[key]['hilo'] == '':
            unique_combinations_inmodel[key]['hilo'] = value['hilo']
        else:
            if unique_combinations_inmodel[key]['hilo'] != value['hilo']:
                print(f"Conflict in hilo for combination {key}: {unique_combinations_inmodel[key]['hilo']} vs {value['hilo']}")

unique_combinations_inmodel_high = {}
unique_combinations_inmodel_low = {}
high_samples_num = 0
low_samples_num = 0
for key in unique_combinations_inmodel:
    input_values = unique_combinations_inmodel[key]['input']
    if input_values:  # Ensure that the list is not empty
        input_mean = sum(input_values) / len(input_values)
        unique_combinations_inmodel[key]['input_mean'] = round(input_mean, 4)
    if unique_combinations_inmodel[key]['hilo'] == 'high':
        unique_combinations_inmodel_high[key] = unique_combinations_inmodel[key]
        high_samples_num += len(input_values)
    elif unique_combinations_inmodel[key]['hilo'] == 'low':
        unique_combinations_inmodel_low[key] = unique_combinations_inmodel[key]
        low_samples_num += len(input_values)
    else:
        print(f"Unexpected hilo value for combination {key}: {unique_combinations_inmodel[key]['hilo']}")

for key in unique_combinations_all:
    input_values = unique_combinations_all[key]['input']
    if input_values:  # Ensure that the list is not empty
        input_mean = sum(input_values) / len(input_values)
        unique_combinations_all[key]['input_mean'] = round(input_mean, 4)
     
        
print(f"{len(unique_combinations_inmodel_high)} pairs of high in 10k wikipidia model(with {high_samples_num} samples): {unique_combinations_inmodel_high}")
print(f"{len(unique_combinations_inmodel_low)} pairs of low in 10k wikipidia model(with {low_samples_num} samples): {unique_combinations_inmodel_low}")



Number of rows in the dataset: 3600
Number of unique combinations of 'verb', 'noun', and 'landmark': 120
Number of participants: 60
Number of unique words: 95
Number of words in pretrained wikipedia 10k model: 58
[32mNumber of unique combinations found in the 10k wikipidia model: 8[0m
[32mNumber of unique combinations found: 120[0m
4 pairs of high in 10k wikipidia model(with 120 samples): {('boom', 'noise', 'thunder'): {'input': [6, 6, 6, 7, 7, 7, 6, 6, 6, 5, 5, 6, 7, 7, 6, 7, 6, 7, 5, 7, 6, 7, 5, 6, 7, 3], 'hilo': 'high', 'input_mean': 6.1154}, ('bow', 'government', 'submit'): {'input': [6, 6, 3, 7, 7, 5, 3, 2, 5, 5, 6, 7, 6, 6, 7, 6, 3, 7, 4, 6, 3, 2, 7, 7, 7, 7], 'hilo': 'high', 'input_mean': 5.3846}, ('bow', 'company', 'submit'): {'input': [5, 6, 6, 4, 6, 5, 5, 4, 6, 2, 2, 5, 6, 6, 2, 4, 4, 6, 7, 6, 6, 2, 3, 3, 2, 4, 2, 5, 6, 2, 5, 2, 5, 3], 'hilo': 'high', 'input_mean': 4.3235}, ('boom', 'gun', 'thunder'): {'input': [6, 6, 7, 5, 6, 6, 5, 6, 6, 6, 6, 6, 6, 7, 4, 6, 7, 6, 7, 5,

In [80]:
import pandas as pd
from scipy.stats import spearmanr

# leave-one-out Spearman correlation（UpperBound）

df = pd.read_csv(data_file, sep=' ', header=None, names=["participant", "verb", "noun", "landmark", "input", "hilo"], skiprows=1)
# key: (verb, noun, landmark)
pivot = df.pivot_table(index=["verb", "noun", "landmark"], columns="participant", values="input")
rhos, ps = [], []
for participant in pivot.columns:
    self_scores = pivot[participant]
    others_mean = pivot.drop(columns=participant).mean(axis=1)
    rho, p = spearmanr(self_scores, others_mean, nan_policy='omit')
    rhos.append(rho)
    ps.append(p)

upper_bound_rho = sum(rhos) / len(rhos)
upper_bound_p = sum(ps) / len(ps)
print(f"UpperBound Spearman correlation (inter-subject agreement): {upper_bound_rho:.4f} with p-value {upper_bound_p:.4f}")

UpperBound Spearman correlation (inter-subject agreement): 0.6604 with p-value 0.0001


In [81]:
# (ii) - Compose the vectors of the extracted word pairs by testing different compositional functions
# Did the paper point out any specific words vector models to use??
# TODO: Change to additive_composition(vector1, vector2) and multiplicative_composition(vector1, vector2) functions to avoid calculate word vectors multiple times in (iii)
import numpy as np
import spacy
import pandas as pd
from scipy.spatial.distance import cosine

enabled_fasttext_model = True
if enabled_fasttext_model:
    import fasttext.util
    import fasttext
    # fasttext.util.download_model('en', if_exists='ignore')
    ft = fasttext.load_model('/home/gushuota@GU.GU.SE/model/fasttext/cc.en.300.bin')
    # ft = fasttext.load_model('cc.en.300.bin')

# get word vector from model
# Here, we can easily change to other models
def get_word_vector(word):
    if enabled_fasttext_model:
        word_vector = ft.get_word_vector(word)
    else:
        # loaded the pretrained wikipidia model, miss a lot of words
        # word_vector = space_10k[word] 
        # word_vector = ppmispace_10k[word]
        word_vector = svdspace_10k[word]
    
    # word_vector = normalize(word_vector)
    return word_vector


# cosine similarity
def cosine_similarity(vec1, vec2):
    similarity = 1 - cosine(vec1, vec2)
    # print(f"Cosine Similarity: {similarity}")
    if similarity == np.nan:
        similarity = 0.0
    return  similarity

# compute additive composition of word vectors
def additive_composition(vector1, vector2, weight1=1, weight2=1):
    # return vector1*weight1 + vector2*weight2  # Weighted sum of the two vectors
    return normalize(vector1*weight1 + vector2*weight2)


# compute multiplicative composition of word vectors
def multiplicative_composition(vector1, vector2):
    # return vector1 * vector2  # Element-wise multiplication
    return normalize(vector1 * vector2)

# Compute the combined model by combining both additive and multiplicative compositions
def combined_composition(vector1, vector2, alpha=0.5, beta=0.5):
    # Compute additive and multiplicative compositions
    additive_vector = vector1 + vector2
    multiplicative_vector = vector1 * vector2
    # Combine using the weighted sum
    combined_vector = alpha * additive_vector + beta * multiplicative_vector

    # return combined_vector
    return normalize(combined_vector)

# Example usage for the first pair in the dataset
verb, noun, landmark = list(unique_combinations_inmodel_high.keys())[0]
print(f"Testing verb: {verb}, noun: {noun}, landmark: {landmark}")

verb_vector = get_word_vector(verb)
noun_vector = get_word_vector(noun)
landmark_vector = get_word_vector(landmark)

additive_vector = additive_composition(verb_vector, noun_vector)
additive_landmark_vector = additive_composition(landmark_vector, noun_vector)
print(f"Additive Composition Similarity: ", cosine_similarity(additive_vector, additive_landmark_vector))

multiplicative_vector = multiplicative_composition(verb_vector, noun_vector)
multiplicative_landmark_vector = multiplicative_composition(landmark_vector, noun_vector)
print(f"Multiplicative Composition Similarity: ", cosine_similarity(multiplicative_vector, multiplicative_landmark_vector))

combine_vector = combined_composition(verb_vector, noun_vector)
combine_landmark_vector = combined_composition(landmark_vector, noun_vector)
print(f"Combined Composition Similarity: ", cosine_similarity(combine_vector, combine_landmark_vector))


Testing verb: boom, noun: noise, landmark: thunder
Additive Composition Similarity:  0.702345
Multiplicative Composition Similarity:  0.46010327
Combined Composition Similarity:  0.69281757


In [None]:
# (iii) - Compare the cosine similarity scores between vectors of phrases with the average human scores

from scipy.stats import spearmanr

# compare the cosine similarity for each phrase pair with human input
human_scores_high = df[df['hilo'] == 'high']['input'].values
human_scores_low = df[df['hilo'] == 'low']['input'].values

# Collect the cosine similarities for the "high" and "low" tasks
hscore_high, hscore_low = [], []
base_cosin_sim_high, base_cosin_sim_low = [], []
add_cosin_sim_high, add_cosin_sim_low = [], []
weight_add_cosin_sim_high, weight_add_cosin_sim_low = [], []
multipli_cosin_sim_high, multipli_cosin_sim_low = [], []
combine_cosin_sim_high, combine_cosin_sim_low = [], []


if enabled_fasttext_model:
    test_unique_combinations = unique_combinations_all
else:
    test_unique_combinations = unique_combinations_inmodel

for comb, values in test_unique_combinations.items():
    verb, noun, landmark = comb
    verb_vector = get_word_vector(verb)
    noun_vector = get_word_vector(noun)
    landmark_vector = get_word_vector(landmark)
    
    ## additive composition
    add_com_vector = additive_composition(verb_vector, noun_vector)
    add_landmark_vector = additive_composition(landmark_vector, noun_vector)
    add_cosine_sim = cosine_similarity(add_com_vector, add_landmark_vector)

    ## weighted additive composition
    # For the best performing model the weight for the verb was 80% and for the noun 20%.
    weight_add_com_vector = additive_composition(verb_vector, noun_vector, weight1=0.8, weight2=0.2)
    weight_add_landmark_vector = additive_composition(landmark_vector, noun_vector, weight1=0.8, weight2=0.2)
    weight_add_cosine_sim = cosine_similarity(weight_add_com_vector, weight_add_landmark_vector)
    
    ## multiplicative composition
    multipli_com_vector = multiplicative_composition(verb_vector, noun_vector)
    multipli_landmark_vector = multiplicative_composition(landmark_vector, noun_vector)
    multipli_cosine_sim = cosine_similarity(multipli_com_vector, multipli_landmark_vector)
    
    ## combined composition
    combine_com_vector = combined_composition(verb_vector, noun_vector, alpha=0.8, beta=0.2) # alpha=0.5, beta=0.5, alpha=0.4, beta=0.6
    combine_landmark_vector = combined_composition(landmark_vector, noun_vector, alpha=0.4, beta=0.6)
    combine_cosine_sim = cosine_similarity(combine_com_vector, combine_landmark_vector)

    base_cosin_sim = cosine_similarity(verb_vector, landmark_vector)
    
    if values['hilo'] == 'high':
        hscore_high.extend([values['input_mean']])
        add_cosin_sim_high.extend([add_cosine_sim])
        weight_add_cosin_sim_high.extend([weight_add_cosine_sim])
        multipli_cosin_sim_high.extend([multipli_cosine_sim])
        combine_cosin_sim_high.extend([combine_cosine_sim])
        base_cosin_sim_high.extend([base_cosin_sim])
    elif values['hilo'] == 'low':
        hscore_low.extend([values['input_mean']])
        add_cosin_sim_low.extend([add_cosine_sim])
        weight_add_cosin_sim_low.extend([weight_add_cosine_sim])
        multipli_cosin_sim_low.extend([multipli_cosine_sim])
        combine_cosin_sim_low.extend([combine_cosine_sim])
        base_cosin_sim_low.extend([base_cosin_sim])
    else:
        print(f"Unexpected hilo value for combination {comb}: {values['hilo']}")

# Spearman rank correlation between cosine similarity and human scores
add_spearman = spearmanr(hscore_high + hscore_low, add_cosin_sim_high + add_cosin_sim_low)
weight_add_spearman = spearmanr(hscore_high + hscore_low, weight_add_cosin_sim_high + weight_add_cosin_sim_low)
multipli_spearman = spearmanr(hscore_high + hscore_low, multipli_cosin_sim_high + multipli_cosin_sim_low)
combine_spearman = spearmanr(hscore_high + hscore_low, combine_cosin_sim_high + combine_cosin_sim_low)
base_spearman = spearmanr(hscore_high + hscore_low, base_cosin_sim_high + base_cosin_sim_low)


# Calculate the average cosine similarity for high and low tasks
avg_hscore_high = np.mean(hscore_high)
avg_hscore_low = np.mean(hscore_low)

avg_add_cosin_sim_high = np.mean(add_cosin_sim_high)
avg_add_cosin_sim_low = np.mean(add_cosin_sim_low)
avg_weight_add_cosin_sim_high = np.mean(weight_add_cosin_sim_high)
avg_weight_add_cosin_sim_low = np.mean(weight_add_cosin_sim_low)

avg_multipli_cosin_sim_high = np.mean(multipli_cosin_sim_high)
avg_multipli_cosin_sim_low = np.mean(multipli_cosin_sim_low)

avg_combine_cosin_sim_high = np.mean(combine_cosin_sim_high)
avg_combine_cosin_sim_low = np.mean(combine_cosin_sim_low)

avg_base_cosin_sim_high = np.mean(base_cosin_sim_high)
avg_base_cosin_sim_low = np.mean(base_cosin_sim_low)

# Print Spearman correlations for both high and low tasks
data_print = {
    "Method": ["NonComp", "Additive", "WeightedAdditive", "Multiplicative", "Combined", "Human"],
    "HighAvgScore": [avg_base_cosin_sim_high, avg_add_cosin_sim_high,  avg_weight_add_cosin_sim_high, avg_multipli_cosin_sim_high, avg_combine_cosin_sim_high, avg_hscore_high],
    "LowAvgScore": [avg_base_cosin_sim_low, avg_add_cosin_sim_low,  avg_weight_add_cosin_sim_low, avg_multipli_cosin_sim_low, avg_combine_cosin_sim_low, avg_hscore_low],
    "SpearmanCor": [base_spearman.correlation, add_spearman.correlation, weight_add_spearman.correlation, multipli_spearman.correlation, combine_spearman.correlation, upper_bound_rho],
    "SpearmanPvalue": [base_spearman.pvalue, add_spearman.pvalue, weight_add_spearman.pvalue, multipli_spearman.pvalue, combine_spearman.pvalue, upper_bound_p],
    
}

# Create a DataFrame
df_print = pd.DataFrame(data_print)
print(df_print)


             Method  HighAvgScore  LowAvgScore  SpearmanCor  SpearmanPvalue
0           NonComp      0.333687     0.313450     0.253610        0.005191
1          Additive      0.686278     0.677492     0.046742        0.612188
2  WeightedAdditive      0.427350     0.405725     0.222157        0.014741
3    Multiplicative      0.440778     0.393511     0.337915        0.000160
4          Combined      0.685325     0.676609     0.043999        0.633242
5             Human      5.084542     3.286200     0.660356        0.000096


**Any comments/thoughts should go here:**

We evaluated five compositional functions—**NonCompositional (NonComp)**, **Additive**, **Weighted Additive**, **Multiplicative**, and **Combined**—using the sentence similarity dataset `mitchell_lapata_acl08.txt`, which contains 120 subject-verb-landmark triplets. The evaluation was based on each model's ability to:
1. Distinguish between high and low similarity sentence pairs, and  
2. Correlate with human similarity judgments, as measured by **Spearman’s rank correlation coefficient**.


During the data preprocessing stage, we initially used the `wikipedia_10k` semantic space model as we used before. However, it only contained **8 complete triplets** from the evaluation dataset, which was insufficient for a robust analysis. Moreover, the resulting scores were not statistically significant.
To overcome this limitation, we adopted the **`fasttext` model**, which provides broader vocabulary coverage. This change enabled us to evaluate all **120 sentence pairs** in the dataset, ensuring statistical validity and completeness of the results.


Among the evaluated methods, the **Multiplicative model clearly worked better** than the others. It achieved the **highest Spearman correlation (ρ = 0.3379, p = 0.00016)** with human judgments and effectively distinguished between High similarity pairs (average score: **0.4408**) and Low similarity pairs (average score: **0.3935**).  A moderate effect size is indicated by the **Spearman correlation** between the human assessments and the model-calculated phrases, which is *ρ = 0.3379 < 0.5*. We can therefore infer from its effect size that, while not consistent across all samples, it captures a certain amount of semantic similarity that is consistent with human assessments.

 In contrast, The **Additive** and **Combined** models failed to show significant differences between high and low similarity items and their correlation with human ratings was near zero. The **Weighted Additive** model performed moderately better but still underperformed compared to the Multiplicative model and required parameter tuning.


The **Multiplicative model** performs better lies in its ability to highlight **shared, meaningful features** between two word vectors. Unlike additive models, which simply average or sum the information and may weaken semantic signals, the multiplicative model emphasizes the dimensions where **both vectors have strong values**. This leads to more **focused representation**, Better **filtering of irrelevant components** and stronger emphasis on **semantically aligned features**

# Literature

[1] C. Silberer and M. Lapata. Learning grounded meaning representations with autoencoders. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 721–732, Baltimore, Maryland, USA, June 23–25 2014 2014. Association for Computational Linguistics.  

[2] Mitchell, J., & Lapata, M. (2008). Vector-based Models of Semantic Composition. In Proceedings of ACL-08: HLT (pp. 236–244). Association for Computational Linguistics.
  
[3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.

[4] E. Vylomova, L. Rimell, T. Cohn, and T. Baldwin. Take and took, gaggle and goose, book and read: Evaluating the utility of vector differences for lexical relation learning. arXiv, arXiv:1509.01692 [cs.CL], 2015.

## Statement of contribution

Briefly state how many times you have met for discussions, who was present, to what degree each member contributed to the discussion and the final answers you are submitting.

## Marks

The assignment is marked on a 7-level scale where 4 is sufficient to complete the assignment; 5 is good solid work; 6 is excellent work, covers most of the assignment; and 7: creative work. 

This assignment has a total of 60 marks. These translate to grades as follows: 1 = 17% 2 = 34%, 3 = 50%, 4 = 67%, 5 = 75%, 6 = 84%, 7 = 92% where %s are interpreted as lower bounds to achieve that grade.