# Week 7, Lesson 4, Activity 7: Evaluation tests for semantic representations

&copy;2021, Ekaterina Kochmar

Your task in this activity is to:

- Implement an analogy solver algorithm and test word embeddings on this task.

## Step 1: Familiarise yourself with word embeddings via spaCy

Let's first access word vectors and measure semantic similarity between words. `spaCy` has nice functionality around that.

Let's start by measuring similarity between some sample words (feel free to experiment with your own ones). What do these similarities suggest about the meaning of the words? Which ones are most similar to each other?

In [None]:
import spacy
nlp = spacy.load('en_core_web_md')

text = u'cat dog apple orange pasta pizza coffee tea'
words = nlp(text)

print("\t" + text.replace(" ", "\t"))

# Print out word similarities in a table
for word1 in words:
    output = str(word1) + "\t"
    for word2 in words:
        output += str(round(word1.similarity(word2), 4)) + "\t"
    print(output)

## Step 2: Implement a word analogy task

Now, let's try to code the word analogy task and see if our algorithm can come up with the solution similar to the one presented in Mikolov et al.'s [paper](https://arxiv.org/pdf/1310.4546.pdf):

That is, our analogy task will encode the relation `country:capital` but the computer won't be explicitly told that this is the relation to be used. Instead we'll ask a question "*Russia is to Moscow as China is to what*? (as usual, feel free to insert your own variants). 

Let's first provide the list of countries and capitals in alphabetical order. To mix things up a bit, let's add some contries (e.g., *Switzerland* and *Brazil*) with no corresponding capitals on the list, some capitals (e.g., *Amsterdam* and *London*) with no corresponding countries, and some cities (e.g., *Barcelona* and *Venice*) that are not capitals. You can always check whether the model has a vector for the word by printing out part of the word vector (always a good idea to check the data you are working with!):


In [None]:
text = u'Amsterdam Ankara Athens Australia Barcelona Beijing Berlin Brazil Chicago China '
text += u'France Germany Greece Italy Japan Lisbon London Madrid Moscow Paris '
text += u'Poland Portugal Rome Russia Spain Switzerland Tokyo Turkey Venice Warsaw '
words = nlp(text)

for word in words:
    print(word)
    print (word.vector[:5])

To measure similarity, you will need cosine to be defined:

In [None]:
import numpy as np
import math

#Implement cosine similarity
def cosine(vec1, vec2):
    if not len(vec1)==len(vec2): return 0
    num = 0.0 # dot product
    vec1_len = 0.0 # length of vec1
    vec2_len = 0.0 # length of vec2
    # calculate cosine value
    return # cosine value

Now you are all set to try out the analogy task.

In [None]:
question = u"Russia is to Moscow as China is to WHAT?"
text = nlp(question)
source1 = text[0]
source2 = text[3]
target1 = text[5]

max_sim = 0.0
target2 = "N/A"

#Apply the operations on vectors
target2_vector = # this should be vector(source2) – vector(source1) + vector(target1)

#Find the word with the most similar vector to the result
for word in words:
    if not (str(word)==str(target1) or str(word)==str(source1) or str(word)==str(source2)):
        current_sim = cosine(target2_vector, word.vector)
        if current_sim >= max_sim:
            max_sim = current_sim 
            target2 = word

print(question)
print(target2)

Finally, let's run the task on all countries:

In [None]:
#Define analogy task as a separate method
#Note that the code below is almost exactly the same
def analogy_task(country):
    question = u"Russia is to Moscow as " + country
    text = nlp(question)
    source1 = text[0]
    source2 = text[3]
    target1 = text[5]

    max_sim = 0.0
    target2 = "N/A"

    target2_vector = # this should be vector(source2) – vector(source1) + vector(target1)

    # follow up with the rest of the code
    
    print(question)
    print("\t is to " + str(target2))
    

countries = ["China", "France", "Germany", "Greece", "Italy", 
             "Japan", "Poland", "Portugal", "Spain", "Turkey"]

for country in countries:
    analogy_task(country)

Does the result correspond to the real state-of-affairs? 

**Optional:** Apply the analogy task to pairs of words linked with other types of relations. For inspiration, consider the examples from Mikolov et al.'s [paper](https://arxiv.org/pdf/1301.3781.pdf)