# A2: Vector Semantics

Nikolai Ilinykh, Mehdi Ghanimifard, Wafia Adouane and Simon Dobnik


The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Write all your answers and the code in the appropriate boxes below.

---

In this lab we will look at how to build distributional semantic models from corpora and use semantic similarity captured by these models to do semantic tasks. We are also going to examine how different vector composition functions for phrases affect both the model and the learned information about similarities.  

Note that this lab uses a code from `dist_erk.py`, which contains functions that highly resemble those shown during the lecture. In the end, you can use either of the functions (from the lecture / from the file) to solve the tasks.

In [1]:
# the following command simply imports all the methods from that code.
from dist_erk import *

## 1. Loading a corpus

**Important**: All necessary files which are used in this notebook are available on mlt-gpu, check `/srv/data/computational-semantics-assignment-02`.

To train a distributional model, we first need a sufficiently large collection of texts which contain different words used frequently enough in different contexts. Here we will use a section of the Wikipedia corpus (`wikipedia.txt`. This file has been borrowed from another lab by [Richard Johansson](http://www.cse.chalmers.se/~richajo/).

When unpacked, the file is 151mb, hence if you are using the MLT servers you should store it in a temporary folder outside your home and adjust the `corpus_dir` path below.  
<!-- <It may already exist in `/opt/mlt/courses/cl2015/a5`.> -->


In [2]:
corpus_dir = '/'

## 2. Building a model

Now you are ready to build the model.  
Using the methods from the code imported above build three word matrices with 1000 dimensions as follows:  

(i) with raw counts (saved to a variable `space_1k`);  
(ii) with PPMI (`ppmispace_1k`);  
(iii) with reduced dimensions SVD (`svdspace_1k`).  
For the latter use `svddim=5`. **[5 marks]**

Your task is to replace `...` with function calls. Functions are imported from `dist_erk.py` earlier, and they largely resemble functions shown during the lecture.

In [3]:
numdims = 1000
svddim = 5

# which words to use as targets and context words?
# we need to count the words and keep only the N most frequent ones
# which function would you use here with which variable?
ktw = do_word_count('/srv/data/computational-semantics-assignment-02/', numdims)

wi = make_word_index(ktw)
# words_in_order = ... # sorted words

# create different spaces (the original matrix space, the ppmi space, the svd space)
# which functions with which arguments would you use here?
print('create count matrices')
space_1k = make_space('/srv/data/computational-semantics-assignment-02/', wi, numdims)
print('ppmi transform')
ppmispace_1k = ppmi_transform(space_1k, wi)
print('svd transform')
svdspace_1k = svd_transform(space_1k, numdims, svddim)
print('done.')

reading file wikipedia.txt
create count matrices
reading file wikipedia.txt
ppmi transform
svd transform
done.


In [4]:
# this part is separate because I wanted to split the parts up not to have to rerun them all the time. I will delete it really
# soon once I am sure all this is correct :) 
print('create count matrices')
space_1k = make_space('/srv/data/computational-semantics-assignment-02/', wi, numdims)
print('ppmi transform')
ppmispace_1k = ppmi_transform(space_1k, wi)
print('svd transform')
svdspace_1k = svd_transform(space_1k, numdims, svddim)
print('done.')

create count matrices
reading file wikipedia.txt
ppmi transform
svd transform
done.


In [5]:
# now, to test the space, you can print vector representation for some words
print('house:', space_1k['house'])

house: [2554 3774 3105  567  962  631  443  185  311  189  131   28   93  169
   81  125  151  408  194   90   79   29  217  184   62   15   31   70
   10    1   41   21    1   31   37    1   30    5   25    7    3   20
   11    1   32   36    2    5   66    4    0   46    8   18   28    0
   20    7    8   16   10   40    0  175   10    2    7   19    1  174
   11    3    1    6    0    0    0   10    9   11    7   24    4    4
   14   23   58    7    0   10    2    3   10    6   18    6   13    3
   22    0    3    5    3    7   14    3   40   20   19   15    6    8
   24    4    5    1   19    0    3    1    0   14    0   14   53    7
    7   11    6    5    5    4   12    6   53    1    1  433    4    0
    5    7    7   12    1    1    3    4   17    8   16    1    2   31
    1   12   14    1   44    6   14    9   38    7    2    6    8    1
   10    6   10    1    9    7    9    4    3   10    0   11    3    2
    0    2   11   37    2    0    2    1    5    9   10   16   88    6

Oxford Advanced Dictionary has 185,000 words, hence 1,000 words is not representative. We trained a model with 10,000 words, and 50 dimensions on truncated SVD. It took 40 minutes on a laptop. All matrices are available on mlt-gpu: `ktw_wikipediaktw.npy`, `raw_wikipediaktw.npy`, `ppmi_wikipediaktw.npy`, `svd50_wikipedia10k.npy`. Make sure they are in your path, because they will be loaded below.

In [6]:
import numpy as np

numdims = 10000
svddim = 50

print('Please wait...')
ktw_10k       = np.load('/srv/data/computational-semantics-assignment-02/ktw_wikipediaktw.npy', allow_pickle=True)
space_10k     = np.load('/srv/data/computational-semantics-assignment-02/raw_wikipediaktw.npy', allow_pickle=True).all()
ppmispace_10k = np.load('/srv/data/computational-semantics-assignment-02/ppmi_wikipediaktw.npy', allow_pickle=True).all()
svdspace_10k  = np.load('/srv/data/computational-semantics-assignment-02/svd50_wikipedia10k.npy', allow_pickle=True).all()
print('Done.')


Please wait...
Done.


In [7]:
# testing semantic space
print('house:', space_10k['house'])

house: [2554 3774 3105 ...    0    0    0]


## 3. Testing semantic similarity

The file `similarity_judgements.txt` (a copy is included with this notebook) contains 7,576 pairs of words and their lexical and visual similarities (based on the pictures) collected through crowd-sourcing using Mechanical Turk as described in [1]. The score range from 1 (highly dissimilar) to 5 (highly similar). Note: this is a different dataset from the phrase similarity dataset we discussed during the lecture (the one from [2]). For more information, please read the papers.

The following code will transform similarity scores into a Python-friendly format:

In [8]:
print(ktw_10k)

['' 'the' 'of' ... 'assumptions' 'superhero' 'dots']


In [9]:
word_pairs = [] # test suit word pairs
semantic_similarity = [] 
visual_similarity = []
test_vocab = set()

for index, line in enumerate(open('similarity_judgements.txt')):
    data = line.strip().split('\t')
    if index > 0 and len(data) == 3:
        w1, w2 = tuple(data[0].split('#'))
        # it will check if both words from each pair exist in the word matrix.
        if w1 in ktw_10k and w2 in ktw_10k:
            word_pairs.append((w1, w2))
            test_vocab.update([w1, w2])
            semantic_similarity.append(float(data[1]))
            visual_similarity.append(float(data[2]))
        
print('number of available words to test:', len(test_vocab-(test_vocab-set(ktw))))
print('number of available word pairs to test:', len(word_pairs))
list(zip(word_pairs, visual_similarity, semantic_similarity))

number of available words to test: 12
number of available word pairs to test: 774


[(('stick', 'sword'), 3.4, 2.6),
 (('cabin', 'cabinet'), 1.75, 2.25),
 (('chicken', 'sparrow'), 3.0, 4.5),
 (('bag', 'gate'), 1.0, 1.5),
 (('bull', 'trumpet'), 1.0, 1.25),
 (('balloon', 'brick'), 1.0, 1.0),
 (('helicopter', 'jet'), 2.6, 4.6),
 (('cape', 'spider'), 1.0, 1.2),
 (('boots', 'mouse'), 1.0, 1.0),
 (('kite', 'marble'), 1.0, 1.0),
 (('box', 'horse'), 1.0, 1.0),
 (('doll', 'elephant'), 1.0, 1.0),
 (('brick', 'cannon'), 1.0, 1.0),
 (('ruler', 'turkey'), 1.0, 1.0),
 (('pan', 'pot'), 4.0, 4.75),
 (('bottle', 'orange'), 1.0, 1.25),
 (('plate', 'rock'), 1.75, 1.5),
 (('subway', 'wall'), 1.25, 1.25),
 (('cup', 'level'), 1.0, 1.2),
 (('book', 'submarine'), 1.0, 1.0),
 (('bus', 'gun'), 1.0, 1.0),
 (('inn', 'telephone'), 1.0, 1.4),
 (('bike', 'hook'), 1.0, 1.0),
 (('building', 'car'), 1.2, 1.6),
 (('airplane', 'truck'), 1.2, 3.4),
 (('cat', 'rabbit'), 2.75, 4.5),
 (('cedar', 'pine'), 4.5, 4.75),
 (('corn', 'oak'), 1.75, 2.25),
 (('chapel', 'magazine'), 1.0, 1.0),
 (('book', 'bureau'), 1

Now we are going to test how the cosine similarity between vectors of each of the three spaces (normal space, ppmi, svd) compares with the human similarity judgements for the words in the similarity dataset. Which of the three spaces best approximates human judgements?

For comparison of several scores, we can use [Spearman correlation coefficient](https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient) which is implemented in `scipy.stats.spearmanr` [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html). The values of the Sperman correlation coefficient range from -1, 0 to 1, where 0 indicates no correlation, 1 perfect correaltion and -1 negative correlation. Hence, the greater the number the better the similarity scores align. The p values tells us if the coefficient is statistically significant. For this to be the case, it must be less than or equal to $< 0.05$.

Here is how you can calculate Pearson's correlation coefficient betweeen the scores of visual similarity and semantic similarity of the available words in the test suite:

In [10]:
from scipy import stats

rho, pval = stats.spearmanr(semantic_similarity, visual_similarity)
print("""Visual Similarity vs. Semantic Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))


Visual Similarity vs. Semantic Similarity:
rho     = 0.7122
p-value = 0.0000


Let's now calculate the cosine similarity scores of all word pairs in an ordered list using all three matrices. **[6 marks]**

In [11]:
raw_similarities  = [cosine(w1,w2,space_10k) for w1, w2 in word_pairs]
ppmi_similarities = [cosine(w1,w2,ppmispace_10k) for w1, w2 in word_pairs]
svd_similarities  = [cosine(w1,w2,svdspace_10k) for w1, w2 in word_pairs]

Now, calculate correlation coefficients between lists of similarity scores and the real semantic similarity scores from the experiment. The scores of what model best correlates them? Is this expected? **[6 marks]**

In [12]:
for k,v in {"raw": raw_similarities, "ppmi": ppmi_similarities, "svd": svd_similarities}.items():
    print(f'Original semantic similarity vs. the calculated {k} similarity:')
    rho, pval = stats.spearmanr(semantic_similarity, v)
    print("""
    rho     = {:.4f}
    p-value = {:.4f}""".format(rho, pval))

Original semantic similarity vs. the calculated raw similarity:

    rho     = 0.1522
    p-value = 0.0000
Original semantic similarity vs. the calculated ppmi similarity:

    rho     = 0.4547
    p-value = 0.0000
Original semantic similarity vs. the calculated svd similarity:

    rho     = 0.4232
    p-value = 0.0000


**Your answer should go here:**
The scores of the ppmi model correlate best with the original semantic similarity scores, while the raw scores perform by far the worst, with the svd similarities not being far behind ppmi.  
In my opinion it is not very unexpected, and I will discuss these one by one:
+ raw similarity relies on the raw vectors, where no extra calculation was done; this is a kind of a baseline, and the other versions could perform better or worse than this one - if they are worse that means we are not getting a good meaning representation, and if they perform better, that means that the transformation that we implemented is good at extracting something in the vectors that points to meaning.
+ ppmi similarity relies on vectors that store information about how relevant the co-occurence of the two words is; this means that it stores whether two words form some collocation or common phrase, or whether they occured together by chance. This may be good for determining similarity, as it will give us similar values for the word co-occuring in thematically similar surroundings (e.g. "cold" and "chilly" will often occur in contexts describing winter weather, and less commonly with other contexts). This is not exactly what PPMI tests/saves, but I think this is how it helps improve similarity scores here.
+ the SVD similarity relies on vectors the size of which was reduced via singular value decomposition. This sort of dimensionality reduction essentially forces the computer to come up with a smaller number of categories that will preserve the differences between the different vectors; while we don't understand what those categories really are, the computer is apparently good at noticing stuff that is indicative of a word's "meaning" as represented by the vector.  

Thus, it makes sense that the models which process the raw data in some way that is supposed to highlight/reveal relevant values in the vectors work better than the model with just the raw data.  

Perhaps an SVD model based on a PPMI model (and not the raw one, like in our case) would be even better, but it would take too long a time to make one like that (since you said it took 40min to calculate the 10k versions for you) for this version of the assignment - perhaps it could be a part of a VG attempt based on this assignment though?

We can also calculate correlation coefficients between lists of cosine similarity scores and the real visual similarity scores from the experiment. Which similarity model best correlates with them? How do the correlation coefficients compare with those from the previous comparison - and can you speculate why do we get such results? **[7 marks]**

In [13]:
for k,v in {"raw": raw_similarities, "ppmi": ppmi_similarities, "svd": svd_similarities}.items():
    print(f'Original visual similarity vs. the calculated {k} similarity:')
    rho, pval = stats.spearmanr(visual_similarity, v)
    print("""
    rho     = {:.4f}
    p-value = {:.4f}""".format(rho, pval))

Original visual similarity vs. the calculated raw similarity:

    rho     = 0.1212
    p-value = 0.0007
Original visual similarity vs. the calculated ppmi similarity:

    rho     = 0.3838
    p-value = 0.0000
Original visual similarity vs. the calculated svd similarity:

    rho     = 0.3097
    p-value = 0.0000


**Your answer should go here:** Similarly to the semantic similarity, here the faithfulness of our calculated similarities follows the PPMI > SVD > raw pattern. Overall though the scores are lower than when we compared semantic similarity to our calculated ones. If we look at the stored values for visual_similarity and semantic_similarity, we can see that they do not always correlate (e.g. a pair of words can have a high visual similarity and a low semantic similarity or the other way around). This could be because of the fact that when people were asked to evaluate the similarity of two words or two images, they took different features of these concepts into account. When asked to evaluate the semantic similarity they would think of not only the visuals of the two, but whether they share a hypernym, whether they share any features or relations, whether they are conceptually similar (e.g. ``('chicken', 'sparrow'), 3.0, 4.5)`` shows a high semantic similarity because both are birds, both have feathers and beaks and wings, both eat grain and peck and make bird noises, and could be preyed upon by a bird of prey). When comparing visual similarity the respondents likely focused only on the visual features of the two images, so whatever similarity is not represented like that escaped their judgement; I would expect these similarity scores to tend to be lower than the semantic ones (for the same example as above, the chicken and the sparrow have a lower visual similarity because although they are both birds - so they have beaks, wings, feathers - they have a different silhouette, color, proportions).

## 4. Operations on similarities

We can perform mathematical operations on vectors to derive meaning predictions. For example, we can subtract the normalised vectors for `king` minus `queen` and add the resulting vector to `man` and we hope to get the vector for `woman`. Why? **[3 marks]**

**Your answer should go here:** operations of this kind on vectors are element-wise. Since we are looking at SVD vectors, we can assume that each element in a vector reveals something about its meaning (e.g. one element indicates age, one indicates gender, one indicates power, etc., with different values of these elements representing something on that spectrum). If we looked at raw vectors, it would just indicate how often our word occured together with a different one, but processing the vector with something like SVD means that we force our computer to generalize over that and create its own features that words share or differ in.  
If we calculate ``king - queen`` we will 'neutralize' all of the elements they have in common: if both king and queen are equally high on the power scale, that will be gone in the resulting vector. However, differences pertaining to what sets these two vectors will be preserved: say that being female is encoded as 0.3 in the vector, and being male as the same element having the value 1. As a result, we will get a vector with 0.7 in that place, which is our difference between the two. 
What I do not personally get is why adding this to the vector for man would make it a vector meaning woman; I feel like it should be the other way round and the vector we get would preserve the "masculinity" and not the "femininity".

Here is some helpful code that allows us to calculate such comparisons.

In [14]:
from scipy.spatial import distance

def normalize(vec):
    return vec / veclen(vec)

def find_similar_to(vec1, space):
    # vector similarity funciton
    #sim_fn = lambda a, b: 1-distance.euclidean(normalize(a), normalize(b))
    #sim_fn = lambda a, b: 1-distance.correlation(a, b)
    #sim_fn = lambda a, b: 1-distance.cityblock(normalize(a), normalize(b))
    #sim_fn = lambda a, b: 1-distance.chebyshev(normalize(a), normalize(b))
    #sim_fn = lambda a, b: np.dot(normalize(a), normalize(b))
    sim_fn = lambda a, b: 1-distance.cosine(a, b)

    sims = [
        (word2, sim_fn(vec1, space[word2]))
        for word2 in space.keys()
    ]
    return sorted(sims, key = lambda p:p[1], reverse=True)

Here is how you apply this code. Comment on the results you get. **[3 marks]**

In [15]:
short = normalize(svdspace_10k['short'])
light = normalize(svdspace_10k['light'])
long = normalize(svdspace_10k['long'])
heavy = normalize(svdspace_10k['heavy'])

find_similar_to(light - (heavy - long), svdspace_10k)[:10]

[('long', 0.8733111261346901),
 ('above', 0.8259671977311955),
 ('around', 0.8030776291120685),
 ('sun', 0.7692439111243973),
 ('just', 0.7678481974778111),
 ('wide', 0.767257431992253),
 ('each', 0.7665960260861158),
 ('circle', 0.7647746702909336),
 ('length', 0.7601066921319761),
 ('almost', 0.7542351860536628)]

**Your answer should go here:** as far as I understand, we should be getting "short" pretty high up here - which we are not. I am not sure if the formula is correct (and I already messaged Nikolai about it), but I would also guess that our models simply do not have that many words or that "detailed" vectors. Another option is that the way that the vectors are constructed, or the information that they store is not informative about the semantic differences between these words (vide the example below).

Find 5 similar pairs of pairs of words and test them. Hint: Google for `word analogies examples`. You can also construct analogies that are less lexical but more grammatical, e.g. `see, saw, leave, ?` or analogies that are based on world knowledge as in the [Google analogy dataset](http://download.tensorflow.org/data/questions-words.txt) from [3]. Does the resulting vector similarity confirm your expectations? But remember you can only do this if the words are contained in our vector space with 10,000 dimensions. **[10 marks]**

In [16]:
old = normalize(svdspace_10k['old'])
older = normalize(svdspace_10k['older'])
low = normalize(svdspace_10k['low'])
lower = normalize(svdspace_10k['lower'])
cat = normalize(svdspace_10k['cat'])
lion = normalize(svdspace_10k['lion'])
dog = normalize(svdspace_10k['dog'])
wolf = normalize(svdspace_10k['wolf'])
play = normalize(svdspace_10k['play'])
plays = normalize(svdspace_10k['plays'])
find = normalize(svdspace_10k['find'])
finds = normalize(svdspace_10k['finds'])
known = normalize(svdspace_10k['known'])
unknown = normalize(svdspace_10k['unknown'])
likely = normalize(svdspace_10k['likely'])
unlikely = normalize(svdspace_10k['unlikely'])
son = normalize(svdspace_10k['son'])
daughter = normalize(svdspace_10k['daughter'])
father = normalize(svdspace_10k['father'])
mother = normalize(svdspace_10k['mother'])

In [17]:
pairs = [
    (('old','older'),('low','lower')), (('cat','lion'),('dog','wolf')), (('play','plays'),('find','finds')), 
    (('known','unknown'),('likely','unlikely')), (('son','daughter'),('father','mother'))
]

for pair in pairs:
    pair_1, pair_2 = pair
    w1, w2 = pair_1
    w3, w4 = pair_2
    if w1 in svdspace_10k and w2 in svdspace_10k and w3 in svdspace_10k and w4 in svdspace_10k:
        word1 = normalize(svdspace_10k[w1])
        word2 = normalize(svdspace_10k[w2]) 
        word3 = normalize(svdspace_10k[w3])
        word4 = normalize(svdspace_10k[w4])
    else:
        continue
    print(f'Looking for words similar to {w3} the way {w2} is similar to {w1}. The truest answer is {w4}.')
    similar_words = find_similar_to(word3 - (word2 - word1), svdspace_10k)[:10]
    print(similar_words)
    print()

Looking for words similar to low the way older is similar to old. The truest answer is lower.
[('high', 0.7369297758077877), ('low', 0.705877657514738), ('lower', 0.6999614992022002), ('the', 0.6984720213213879), ('greater', 0.6939754313526645), ('mass', 0.6932188499652133), ('through', 0.6881466856344597), ('an', 0.6814776301637034), ('full', 0.6814555592712804), ('point', 0.6809995834716109)]

Looking for words similar to dog the way lion is similar to cat. The truest answer is wolf.
[('cat', 0.880598975582066), ('dog', 0.8778035859347283), ('baby', 0.7999964122382902), ('boy', 0.7957464088994218), ('pig', 0.7937301924547002), ('wild', 0.7915060372048864), ('girl', 0.78522974135338), ('big', 0.7595891760773924), ('dogs', 0.759046914688173), ('cow', 0.757396808785294)]

Looking for words similar to find the way plays is similar to play. The truest answer is finds.
[('find', 0.8829497569526007), ('get', 0.879411870981567), ('keep', 0.8744868662449081), ('make', 0.8727385408069952), ('t

**Your answer should go here:** For some examples we do get "good" results, for some we do not:
+ In the first one the word that we are looking for is in the top 3 suggestions, which I consider to be a success.
+ In the second one the target word is not found, but we do find other words for animals, and, interestingly, "wild" there, which I guess is some connection in terms of meaning.
+ In the third one we do not find the correct answer. It seems that our vectors are not good at finding morphological differences (though they did in example 1).
+ Similar case as in example 3, but some meaning of likelihood is preserved with words like might, would, etc.
+ Interestingly a lot of family words can be found here (so that the meaning of family is preserved), but the female meaning can only be found with "wife" quite far down the list.

Overall there are some elements of meaning that are preserved, but only one example actually finds the correct word. This may be due to many of these pairs of pairs differing in morphology and not meaning per se, which is not that well stored in our vectors, apparently. It may also be due to our model not being trained on a sufficiently big corpus.

## 5. Semantic composition and phrase similarity **[20 marks]**

In this task, we are going to look at how different semantic composition models, introduced in [2] correlate with human judgements. The file with the dataset `mitchell_lapata_acl08.txt` is included with this notebook (we also used it in the class).

---

Explanation of the task from Discord channel:

**What are we trying to achieve?**  
We want to create models, which can automatically capture differences between meaning of different phrases. These models should also be as good as we are (humans) in this task. Check example (1) from Mitchell/Lapata paper, it has two sentences which share the same words, but their meaning is completely different. We as humans can clearly see this difference, but how could a machine capture it?

It is intuitive that in order to get a meaning of a phrase, we might combine meaning of individual words in this phrase somehow, but how? First, we represent each word with the frequency vector based on the semantic space that we built before (e.g, frequency space, ppmi space, svd space). In other words, each word's meaning is represented by the number of times other words occur in the context, defined by window size. Now, we have a vector for `discussion` and a vector for `thrive`.

_How do we combine these vectors to get a single vector for the phrase `discussion thrive`?_

Such methods of combining meaning vectors into a single item are called _semantic composition methods_ (literally, because we compose semantic meaning of the phrase from its individuals). During the lecture, we tried different semantic composition methods: additive, multiplicative, combined.

Let's say we multiplied the vectors (went with the multiplicative method) and now we have one vector for our phrase.

Remember, we want to have a model that captures differences between the phrases; it means that if `discussion thrive` is our reference phrase, we need to have a different phrase (high or low similarity phrase) to compare it against the reference one. How do we get this other phrase? Well, this other phrase can be either very similar to the reference or not similar at all, right? Let's say we decided to go with the second option and made/constructed a phrase `discussion digress`, which we know is very dissimilar to the reference phrase. We label this pair of phrases as having a low similarity, e.g., `low` in `hilo`. We can also create a different phrase (e.g., `discussion develop`) and use it as a high similarity phrase when paired with our reference phrase, right? This then would be labeled as `high` in `hilo`. This is what `hilo` in the dataset stands for: known information about how similar the reference phrase and the landmark phrase are.

Now, our main task is to automatically learn the similarities/differences between our reference and our landmark, right? We take the first pair: `discussion thrive` vs. `discussion digress`. We have a vector representation for each of these phrases.

**How do we compare two vectors?**

We use cosine similarity to calculate a single score that would tell us about the similarity between these two vectors. The bigger the cosine, the more similar two vectors are. Cosine ranges from 0 to 1 (0 is very low similarity, 1 is very high similarity). Let's say, we get a cosine of 0.89, it means that according to the multiplicative model (remember, we decided to use multiplicative semantic composition method), these phrases are very similar (cosine is quite high). But wait a second, we know that these two phrases should be of low similarity, right? Because this is what the value in `hilo` tells us about this pair - `discussion thrive` and `discussion digress` are not similar to each other. Clearly, our multiplicative method fails to capture it.

**What can we do to improve our model?**

We can try a different composition method to get a phrase vector from phrase's words: let's replace multiplication with addition. Or we can also use combined method. Let's say we used combined method and run cosine again; this time it tells that the cosine score is 0.45. Ok, this seems to be quite low, and it also agrees with our knowledge that these phrases are indeed not similar.

---

In other words, we need to evaluate different composition models (additive, multiplicative, combined) and analyse how well they perform. **How do we analyse their performance?** Because we have the ground-truth for comparison (`hilo` values), we know whether our phrases are actually similar or not. We want our cosine score to reflect this knowledge: if the score is high, but the groun-truth hilo is low, then we have a problem in the model - it did not learn things well, we need to replace the composition function.

---

_Long story short_: `hilo` is something that we compare our cosine to. `hilo` contains correct answers about similarity between phrases, and cosine should agree with this. If reference-landmark pair are `high` in `hilo`, then cosine should be high enough to reflect that. If cosine is not high in this case, then we look at our model and change the composition function. We need to find the function, which give us cosines that are super close to the ground-truth known `hilo` values.

<img src="res.png" alt="drawing" width="500"/>

Next, we want to compute do we compute correlation between our model's predictions and the ground-truth, something similar to the results from [2] (image above):

---

In `High` and `Low` columns we have mean cosine values.
These are calculated by averaging cosine scores for all pairs of phrases per model.
Rows introduce different models: `add` is additive, `multiply` is multiplicative, etc. (these are all described in the paper). `NonComp` is a baseline model, the most "stupid" one, it should be the worst. `UpperBound` is how humans performed in this task (they were asked to rate similarity between pairs of phrases, 1 is the lowest, 7 is the highest). Why these numbers are in a different scale, not from 0 to 1 like cosine, but from 1 to 7? Because they are not normalised, and authors explicitly said that they are interested in relative differences.

We need models which are closer to human ratings. `Add` model has 0.59 mean for `High` and 0.59 for `Low`, so it did not learn to differentiate between high similarity pairs and low similarity ones. This is a bad model then, we need a better one. `WeightAdd` seems to be doing better, the difference between `High` and `Low` is now 0.01, but it's also quite bad - the difference is not that obvious. The best models are `Multiply` and `Combined`, because their mean cosines for `High` and `Low` are quite different from each other. We can see that these two models gave higher cosines for high similarity pairs (0.42 and 0.38), while giving lower cosines for low similarity pairs (0.28 and 0.28). And this is a good result - it shows that these two composition functions are so far the best in (i) giving high cosine go highly similar pairs and low cosine to very dissimilar pairs, and (ii) keeping the distance (range) between high and low cosines quite large.

**However, we can't say how far/close they are when compared to human performance (UpperBound) since human scores are not normalised.**

Still, we want to choose a model which is the closest to humans. This is why we want to run **the correlation test.**

How do we perform the correlation test? We need to see how well *each model* correlates with human judgements. So for each model, we would have a vector of cosine values this model gives for each pair that we have. For example, let's say we have three pairs of phrases and our cosine values from additive model are the following ones: `[0.89, 0.40, 0.70]`. Now we need to get a vector of the same size, but for human scores. What do we have for human scores? We have multiple participants, which means that a single phrase can be evaluated by multiple participants. Let's say, the first phrase has scores from two participants (`6, 7`), so what we would do is that we would average it to have a single number (`(6 + 7) / 2 = 6.5`). With this, we can get a mean vector of human scores per item: `[6.5, 3, 6]`.

Now, we have two vectors and we can run Spearman correlation on these vectors. This is what exactly what the third column in the Results table is showing (the correlation value). There is also a p-value (denoted right below the table).

**What is your ultimate task in this part of the assignment?**

(i) Process the dataset and extract `reference - landmark` pairs; you can use the code from the lecture as something to start with. Try to keep information about human rating (`input`) and high/low similarity (`hilo`), because you will need it for correlation tests. Also, you might want to keep the information about participant id (will be useful for getting average numbers for correlation tests). Which format you should use to keep all this data? It's up to you, but a dictionary-like format could be a good choice.

(ii) Build models of semantic phrase composition: in the lecture we introduced simple additive, simple multiplicative and combined models (details are in [2]). Your task is to take a single pair of phrases, and compute the composition of its vectors using each of these functions. Thus, you will have (at least) three compositional models that take each `noun - verb` phrase from the pair (these phrases can be either references or landmarks) and output a single vector, representing the meaning of this phrase. As your semantic space, you can use pretrained spaces (standard space, ppmi or svd) introduced above. It is up to you which space you use, but for someone who runs your code, it should be pretty straightforward to switch between them.

(iii) calculate Spearman correlation between each model's predictions and human judgements; you should have something similar to the scores that are shown in the paper.

**Thoughts process behind calculating the correlation:**

Let's look at the example pair: reference `child chatter` and high-level similarity landmark (as the last word in the row indicates) `child gabble`. Let's say we have 3 humans evaluating the similarity between these two phrases and we combine their scores into a single vector: `[5, 6, 5]`. We need to average them to get our human vector for correlation: `[5.3]`.

Our A model's output:  
`cosine(p1, p2) = 0.88`, where p1 is the result of addition of word vectors in the reference phrase `child gabble`, and p2 is the result of addition of word vectors in the high-level similarity phrase `child chatter`.  

Therefore, we have human rating vector `[5.3]` and model A output `[0.88]`. Next is to compute correlation between these two vectors. This should give you a correlation value and p-value for the model of choice and human ratings.

Of course, your human rating vectors will be longer (e.g., [6, 7, 3, 4, 5]). Each of your models (A, B, C) will produce a single vector of cosine similarity between these same pairs (e.g., [0.89, 0.98, 0.23, 0.65, 0.55]). The goal is to compare each model's cosine similarity vectors with human rating vectors and identify the model which outputs the best result in terms of being the closest to the way human rate similarity between the phrases.

---

**The minimum to do in this task**: compute correlations for at least _ONE_ model and human ratings. However, this should not be hard to run it for any other model as well. For examples on how to interpret the results, look at Section 5 Results of the original paper.

In [18]:
our_dataset = svdspace_10k  # change here to change the model used

In [19]:
#from the lecture notebook:
with open('./mitchell_lapata_acl08.txt', 'r') as f:
    phrase_dataset = f.read().splitlines()

for line in phrase_dataset[:10]:
    print(line)
    
# get all unique words
words = []
for line in phrase_dataset[1:]:
    _, verb, noun, landmark, _, _ = line.split()
    if verb not in words:
        words.append(verb)
    if noun not in words:
        words.append(noun)
    if landmark not in words:
        words.append(landmark)

participant verb noun landmark input hilo
participant20 stray thought roam 7 low
participant20 stray discussion digress 6 high
participant20 stray eye roam 7 high
participant20 stray child digress 1 low
participant20 throb body pulse 5 high
participant20 throb head shudder 2 low
participant20 throb voice shudder 3 low
participant20 throb vein pulse 6 high
participant20 chatter machine click 4 high


In [20]:
# simply check if all words that we have in our task dataset can be found in the reference corpus (the result should return nothing)
to_remove = []
for w in words:
    if w not in our_dataset:
        print(w)
        to_remove.append(w)
# if something is not found, makes sense to ignore phrases with such non-present words

stray
roam
digress
throb
pulse
shudder
vein
chatter
gabble
tooth
rebound
ricochet
optimism
flicker
waver
flick
subside
lessen
symptom
slump
slouch
stoop
erupt
burst
temper
flare
recoil
flinch
prosper
fluctuate
falter
cigarette
reel
whirl
stagger
glow
cigar


In [21]:
# cleaning the task dataset (we might call it phrase dataset from now on)
# we are removing all phrases which contain non-found words
# this would probably remove other words as well (those, which are paired with the non-found words)

cleaned_phrase_dataset = []
for line in phrase_dataset:
    _, verb, noun, landmark, _, _ = line.split()
    if verb in to_remove or noun in to_remove or landmark in to_remove:
        continue
    cleaned_phrase_dataset.append(line)

target_words = []
for line in cleaned_phrase_dataset[1:]:
    _, verb, noun, landmark, _, _ = line.split()
    if verb not in target_words:
        target_words.append(verb)
    if noun not in target_words:
        target_words.append(noun)
    if landmark not in target_words:
        target_words.append(landmark)

From here on the code is our own:

In [22]:
phrase_dictionary = {}
for string in cleaned_phrase_dataset[1:]:
    participant, verb, noun, landmark_verb, human_input, hilo = string.split(' ')
    pair_1 = (noun, verb)
    pair_2 = (noun, landmark_verb)
    pairs = (pair_1, pair_2)
    
    if pairs not in phrase_dictionary:
        phrase_dictionary[pairs] = {'participant_count': 1, 'input_sum': int(human_input), 'hilo': hilo}
    else:
        phrase_dictionary[pairs]['participant_count'] += 1
        phrase_dictionary[pairs]['input_sum'] += int(human_input)
        
print(phrase_dictionary)

{(('butler', 'bow'), ('butler', 'submit')): {'participant_count': 34, 'input_sum': 100, 'hilo': 'low'}, (('company', 'bow'), ('company', 'submit')): {'participant_count': 34, 'input_sum': 147, 'hilo': 'high'}, (('sale', 'boom'), ('sale', 'thunder')): {'participant_count': 34, 'input_sum': 95, 'hilo': 'low'}, (('gun', 'boom'), ('gun', 'thunder')): {'participant_count': 34, 'input_sum': 191, 'hilo': 'high'}, (('head', 'bow'), ('head', 'submit')): {'participant_count': 26, 'input_sum': 85, 'hilo': 'low'}, (('government', 'bow'), ('government', 'submit')): {'participant_count': 26, 'input_sum': 140, 'hilo': 'high'}, (('noise', 'boom'), ('noise', 'thunder')): {'participant_count': 26, 'input_sum': 159, 'hilo': 'high'}, (('export', 'boom'), ('export', 'thunder')): {'participant_count': 26, 'input_sum': 72, 'hilo': 'low'}}


In [23]:
# here we define functions that represent the compositional models, with the weights based on what was found to be best in [2], 
# if applicable. We also discarded the Kintsch representation as we believe it needed some other variables (as far as we 
# understood it). 

def repr_multiply(word_tuple):
    w1, w2 = word_tuple
    noun = our_dataset[w1]
    verb = our_dataset[w2]
 
    representation = noun * verb
    
    return representation

def repr_add(word_tuple):
    w1, w2 = word_tuple
    noun = our_dataset[w1]
    verb = our_dataset[w2]
 
    representation = noun + verb
    
    return representation

def repr_weight_add(word_tuple):
    w1, w2 = word_tuple
    noun = our_dataset[w1]
    verb = our_dataset[w2]
 
    representation = (0.2 * noun) + (0.8 * verb)
    
    return representation

def repr_combined(word_tuple):
    w1, w2 = word_tuple
    noun = our_dataset[w1]
    verb = our_dataset[w2]
 
    representation = (0.0 * noun) + (0.95 * verb) + (0.05 * noun * verb)
    
    return representation

In [24]:
# here we construct a dictionary to store the values of similarity between representations for different compositional models
# and the average human evaluation

from scipy import spatial

model_pred_dict = {}

for k,v in phrase_dictionary.items():
    word_tuple_1, word_tuple_2 = k
    participant_count = v['participant_count']
    input_sum = v['input_sum']
    hilo = v['hilo']
    
    average_input = input_sum / participant_count
    
    # addition
    repr_add_1 = repr_add(word_tuple_1)
    repr_add_2 = repr_add(word_tuple_2)
    cosine_sim_add = 1 - spatial.distance.cosine(repr_add_1, repr_add_2)
    
    # multiplication
    repr_mult_1 = repr_multiply(word_tuple_1)
    repr_mult_2 = repr_multiply(word_tuple_2)
    cosine_sim_mult = 1 - spatial.distance.cosine(repr_mult_1, repr_mult_2)
    
    # weighted addition
    repr_wadd_1 = repr_weight_add(word_tuple_1)
    repr_wadd_2 = repr_weight_add(word_tuple_2)
    cosine_sim_wadd = 1 - spatial.distance.cosine(repr_wadd_1, repr_wadd_2)
    
    # combined
    repr_comb_1 = repr_combined(word_tuple_1)
    repr_comb_2 = repr_combined(word_tuple_2)
    cosine_sim_comb = 1 - spatial.distance.cosine(repr_comb_1, repr_comb_2)
    
    model_pred_dict[k] = {
        'average_input': average_input, 'addition': cosine_sim_add, 'multiplication': cosine_sim_mult, 
        'weighted_addition': cosine_sim_wadd, 'combined': cosine_sim_comb, 'hilo': hilo
                         }
    
print(model_pred_dict)

{(('butler', 'bow'), ('butler', 'submit')): {'average_input': 2.9411764705882355, 'addition': 0.8366040642896164, 'multiplication': 0.8751666899417615, 'weighted_addition': 0.5624267039698349, 'combined': 0.24661793239357221, 'hilo': 'low'}, (('company', 'bow'), ('company', 'submit')): {'average_input': 4.323529411764706, 'addition': 0.9304615493700634, 'multiplication': 0.9273976875291448, 'weighted_addition': 0.6734756630177691, 'combined': 0.132451617886431, 'hilo': 'high'}, (('sale', 'boom'), ('sale', 'thunder')): {'average_input': 2.7941176470588234, 'addition': 0.8776046136128655, 'multiplication': 0.975175543949629, 'weighted_addition': 0.6847917067605415, 'combined': 0.33894277055275446, 'hilo': 'low'}, (('gun', 'boom'), ('gun', 'thunder')): {'average_input': 5.617647058823529, 'addition': 0.9229430503489647, 'multiplication': 0.9740963478239069, 'weighted_addition': 0.7268572088527075, 'combined': 0.3356509207630427, 'hilo': 'high'}, (('head', 'bow'), ('head', 'submit')): {'av

In [25]:
# here we split the results into high and low similarity lists and also gather them all in lists where the order is relevant,
# so they can be compared using  Spearman correlation.

high_pairs = {'average': [], 'addition': [], 'multiplication': [], 'weighted_addition': [], 'combined': []}
low_pairs = {'average': [], 'addition': [], 'multiplication': [], 'weighted_addition': [], 'combined': []}
full_lists = {'average': [], 'addition': [], 'multiplication': [], 'weighted_addition': [], 'combined': []}

for k,v in model_pred_dict.items():
    word_tuple_1, word_tuple_2 = k
    average = v['average_input']
    addition = v['addition']
    multiplication = v['multiplication']
    weighted_addition = v['weighted_addition']
    combined = v['combined']
    hilo = v['hilo']
    
    
    if hilo == 'high':
        high_pairs['average'].append(average)
        full_lists['average'].append(average)
        
        high_pairs['addition'].append(addition)
        full_lists['addition'].append(addition)
        
        high_pairs['multiplication'].append(multiplication)
        full_lists['multiplication'].append(multiplication)
        
        high_pairs['weighted_addition'].append(weighted_addition)
        full_lists['weighted_addition'].append(weighted_addition)
        
        high_pairs['combined'].append(combined)
        full_lists['combined'].append(combined)
    
    else:
        low_pairs['average'].append(average)
        full_lists['average'].append(average)
        
        low_pairs['addition'].append(addition)
        full_lists['addition'].append(addition)
        
        low_pairs['multiplication'].append(multiplication)
        full_lists['multiplication'].append(multiplication)
        
        low_pairs['weighted_addition'].append(weighted_addition)
        full_lists['weighted_addition'].append(weighted_addition)
        
        low_pairs['combined'].append(combined)
        full_lists['combined'].append(combined)
        
print(high_pairs)
print(low_pairs)
print(full_lists)      

{'average': [4.323529411764706, 5.617647058823529, 5.384615384615385, 6.115384615384615], 'addition': [0.9304615493700634, 0.9229430503489647, 0.931689728059706, 0.9219236351707659], 'multiplication': [0.9273976875291448, 0.9740963478239069, 0.8671235911741174, 0.9713075958327421], 'weighted_addition': [0.6734756630177691, 0.7268572088527075, 0.6779138438287566, 0.7335650252383998], 'combined': [0.132451617886431, 0.3356509207630427, 0.13071763918235635, 0.4065914590057804]}
{'average': [2.9411764705882355, 2.7941176470588234, 3.269230769230769, 2.769230769230769], 'addition': [0.8366040642896164, 0.8776046136128655, 0.8747165074822579, 0.9174691846974397], 'multiplication': [0.8751666899417615, 0.975175543949629, 0.9398637212790923, 0.9361078822218509], 'weighted_addition': [0.5624267039698349, 0.6847917067605415, 0.6163648369816668, 0.7159891091461872], 'combined': [0.24661793239357221, 0.33894277055275446, 0.22380798301971083, 0.36278825807688875]}
{'average': [2.9411764705882355, 4

In [26]:
types = ['addition', 'multiplication', 'weighted_addition', 'combined', 'average']
scores_list = []

for name in types:
    high = high_pairs[name]
    low = low_pairs[name]
    full = full_lists[name]
    average = full_lists['average']
    
    average_high = sum(high) / len(high)
    average_low = sum(low) / len(low)
       
    rho, pval = stats.spearmanr(full, average)

    score_list = [name, round(average_high, 2), round(average_low, 2), round(rho, 2), round(pval, 2)]
    scores_list.append(score_list)  

In [27]:
# this part is done just to print it out in a pretty format

import pandas as pd
import numpy as np

print(pd.DataFrame(np.array(scores_list), columns=["Model", "High", "Low", "Rho", "p-score"]))

               Model  High   Low   Rho p-score
0           addition  0.93  0.88  0.57    0.14
1     multiplication  0.93  0.93  0.05    0.91
2  weighted_addition   0.7  0.64   0.4    0.32
3           combined  0.25  0.29  -0.1    0.82
4            average  5.36  2.94   1.0     0.0


**Any comments/thoughts should go here:** (Maria) I was actually really surprised to see that these results do not align at all with the ones from the Mitchell and Lapata paper, but I assume this is due to there being so few pairs that we could compare stuff for. Clearly, the combined method works the worst here, and, surprisingly, addition or weighted addition work the best. I am not sure I did everything right though so it will be good to compare my results with those of my classmates.

# Literature

  - [1] C. Silberer and M. Lapata. Learning grounded meaning representations with autoencoders. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 721–732, Baltimore, Maryland, USA, June 23–25 2014 2014. Association for Computational Linguistics.  

  - [2] Mitchell, J., & Lapata, M. (2008). Vector-based Models of Semantic Composition. In Proceedings of ACL-08: HLT (pp. 236–244). Association for Computational Linguistics.
  
  - [3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.

## Marks

This assignment has a total of 60 marks.