# Cross-Language Word Embeddings

We have mentioned, and will discuss in more detail this week, how we can reduce the dimensionality of word representations from their original vectors space to an embedding space on the order of a few hundred dimensions. Different modeling choices for word embeddings may be ultimately evaluated by the effectiveness of classifiers, parsers, and other inference models that use those embeddings.

In this assignment, however, we will consider another common method of evaluating word embeddings: by judging the usefulness of pairwise distances between words in the embedding space.

Follow along with the examples in this notebook, and implement the sections of code flagged with **TODO**.

In [2]:
import gensim
import numpy as np
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

We'll start by downloading a plain-text version of the Shakespeare plays we used for the first assignment.

In [3]:
!wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/shakespeare_plays.txt
lines = [s.split() for s in open('shakespeare_plays.txt')]

--2021-04-16 23:01:55--  http://www.ccs.neu.edu/home/dasmith/courses/cs6120/shakespeare_plays.txt
Resolving www.ccs.neu.edu (www.ccs.neu.edu)... 52.70.229.197
Connecting to www.ccs.neu.edu (www.ccs.neu.edu)|52.70.229.197|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4746840 (4.5M) [text/plain]
Saving to: ‘shakespeare_plays.txt’


2021-04-16 23:01:55 (68.8 MB/s) - ‘shakespeare_plays.txt’ saved [4746840/4746840]



Then, we'll estimate a simple word2vec model on the Shakespeare texts.

In [4]:
model = Word2Vec(lines)

Even with such a small training set size, you can perform some standard analogy tasks.

In [5]:
model.wv.most_similar(positive=['king','woman'], negative=['man'])

[('queen', 0.8016893863677979),
 ('prince', 0.7813111543655396),
 ('duke', 0.7290380001068115),
 ('warwick', 0.7129981517791748),
 ('york', 0.7128998041152954),
 ('clarence', 0.7102699875831604),
 ('princess', 0.70278400182724),
 ('son', 0.6930181980133057),
 ('gloucester', 0.6921982765197754),
 ('suffolk', 0.6908835172653198)]

For the rest of this assignment, we will focus on finding words with similar embeddings, both within and across languages. For example, what words are similar to the name of the title character of *Othello*?

In [6]:
model.wv.most_similar(positive=['othello'])
#model.wv.most_similar(positive=['brutus'])

[('desdemona', 0.961108386516571),
 ('iago', 0.9302796125411987),
 ('cleopatra', 0.9296098947525024),
 ('rosalind', 0.9256273508071899),
 ('imogen', 0.9237173795700073),
 ('cressida', 0.9197252988815308),
 ('fal', 0.9162404537200928),
 ('ham', 0.9103072881698608),
 ('jul', 0.900834321975708),
 ('pisanio', 0.8976864814758301)]

This search uses cosine similarity. In the default API, you should see the same similarity between the words `othello` and `desdemona` as in the search results above.

In [7]:
model.wv.similarity('othello', 'desdemona')

0.9611084

**TODO**: Your **first task**, therefore, is to implement your own cosine similarity function so that you can reuse it outside of the context of the gensim model object.

In [8]:
## TODO: Implement cosim
def cosim(v1, v2):
  ## return cosine similarity between v1 and v2
  return np.dot(v1, v2)/(np.dot(v1, v1) * np.dot(v2, v2))**0.5

## This should give a result similar to model.wv.similarity:
cosim(model.wv['othello'], model.wv['desdemona'])

0.9611083708021116

## Evaluation

We could collect a lot of human judgments about how similar pairs of words, or pairs of Shakespearean characters, are. Then we could compare different word-embedding models by their ability to replicate these human judgments.

If we extend our ambition to multiple languages, however, we can use a word translation task to evaluate word embeddings.

We will use a subset of [Facebook AI's FastText cross-language embeddings](https://fasttext.cc/docs/en/aligned-vectors.html) for several languages. Your task will be to compare English both to French, and to *one more language* from the following set: Arabic, German, Portuguese, Russian, Spanish, Vietnamese, and Chinese.

In [9]:
!wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.en.vec
!wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.fr.vec

# TODO: uncomment at least one of these to work with another language
!wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.ar.vec
!wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.de.vec
!wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.pt.vec
!wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.ru.vec
!wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.es.vec
!wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.vi.vec
!wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.zh.vec

--2021-04-16 23:02:28--  http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.en.vec
Resolving www.ccs.neu.edu (www.ccs.neu.edu)... 52.70.229.197
Connecting to www.ccs.neu.edu (www.ccs.neu.edu)|52.70.229.197|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 67681172 (65M)
Saving to: ‘30k.en.vec’


2021-04-16 23:02:29 (293 MB/s) - ‘30k.en.vec’ saved [67681172/67681172]

--2021-04-16 23:02:29--  http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.fr.vec
Resolving www.ccs.neu.edu (www.ccs.neu.edu)... 52.70.229.197
Connecting to www.ccs.neu.edu (www.ccs.neu.edu)|52.70.229.197|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 67802327 (65M)
Saving to: ‘30k.fr.vec’


2021-04-16 23:02:29 (235 MB/s) - ‘30k.fr.vec’ saved [67802327/67802327]

--2021-04-16 23:02:29--  http://www.ccs.neu.edu/home/dasmith/courses/cs6120/30k.ar.vec
Resolving www.ccs.neu.edu (www.ccs.neu.edu)... 52.70.229.197
Connecting to www.ccs.neu.edu (www.ccs.neu.edu)|52.70.229.

We'll start by loading the word vectors from their textual file format to a dictionary mapping words to numpy arrays.

In [10]:
def vecref(s):
  (word, srec) = s.split(' ', 1)
  return (word, np.fromstring(srec, sep=' '))

def ftvectors(fname):
  return { k:v for (k, v) in [vecref(s) for s in open(fname)] if len(v) > 1} 

envec = ftvectors('30k.en.vec')
frvec = ftvectors('30k.fr.vec')

# TODO: load vectors for one more language, such as zhvec (Chinese)
arvec = ftvectors('30k.ar.vec')
devec = ftvectors('30k.de.vec')
ptvec = ftvectors('30k.pt.vec')
ruvec = ftvectors('30k.ru.vec')
esvec = ftvectors('30k.es.vec')
vivec = ftvectors('30k.vi.vec')
zhvec = ftvectors('30k.zh.vec')

**TODO**: Your next task is to write a simple function that takes a vector and a dictionary of vectors and finds the most similar item in the dictionary. For this assignment, a linear scan through the dictionary using your `cosim` function from above is acceptible.

In [11]:
## TODO: implement this search function
def mostSimilar(vec, vecDict):
  ## Use your cosim function from above
  mostSimilar = ''
  similarity = 0
  for key in vecDict:
    tmp_key = key
    tmp_similarity = cosim(vec, vecDict[tmp_key])
    if tmp_similarity > similarity:
      mostSimilar = key
      similarity = tmp_similarity
  return (mostSimilar, similarity)

## some example searches
print([mostSimilar(envec[e], frvec) for e in ['computer', 'germany', 'matrix', 'physics', 'yeast']])
print([mostSimilar(envec[e], devec) for e in ['computer', 'germany', 'matrix', 'physics', 'yeast']])
print([mostSimilar(envec[e], zhvec) for e in ['computer', 'germany', 'matrix', 'physics', 'yeast']])
print([mostSimilar(envec[e], arvec) for e in ['computer', 'germany', 'matrix', 'physics', 'yeast']])
print([mostSimilar(envec[e], ptvec) for e in ['computer', 'germany', 'matrix', 'physics', 'yeast']])
print([mostSimilar(envec[e], ruvec) for e in ['computer', 'germany', 'matrix', 'physics', 'yeast']])
print([mostSimilar(envec[e], esvec) for e in ['computer', 'germany', 'matrix', 'physics', 'yeast']])
print([mostSimilar(envec[e], vivec) for e in ['computer', 'germany', 'matrix', 'physics', 'yeast']])

[('informatique', 0.5023827767603763), ('allemagne', 0.5937184138759639), ('matrice', 0.5088361302065516), ('physique', 0.4555543434796394), ('fermentation', 0.3504105196166514)]
[('computer', 0.5037721476432345), ('deutschland', 0.4705668805911338), ('matrix', 0.5157118468900325), ('physik', 0.5837534244545665), ('enzyme', 0.2826869534747046)]
[('電腦', 0.6331072804288355), ('德國', 0.6117215949997674), ('矩陣', 0.4826503662879594), ('物理', 0.539459807891487), ('酵母', 0.5094100865393028)]
[('الحاسوب', 0.4701278935881928), ('ألمانيا', 0.5330396598131463), ('مصفوفة', 0.34179170709793727), ('الفيزياء', 0.5058640608484845), ('بروتين', 0.2576853070061513)]
[('computador', 0.4988965701007087), ('alemanha', 0.6288992279001664), ('matrix', 0.4204704252830658), ('astrofísica', 0.5025296240252473), ('fermentação', 0.40297918812091316)]
[('компьютер', 0.40957792244343183), ('германия', 0.5066204406798313), ('матрица', 0.39972205911502523), ('физики', 0.532605548844492), ('белков', 0.25252585117716175)]


Some matches make more sense than others. Note that `computer` most closely matches `informatique`, the French term for *computer science*. If you looked further down the list, you would see `ordinateur`, the term for *computer*. This is one weakness of a focus only on embeddings for word *types* independent of context.

To evalute cross-language embeddings more broadly, we'll look at a dataset of links between Wikipedia articles.

In [12]:
!wget http://www.ccs.neu.edu/home/dasmith/courses/cs6120/links.tab
links = [s.split() for s in open('links.tab')]

--2021-04-16 23:03:08--  http://www.ccs.neu.edu/home/dasmith/courses/cs6120/links.tab
Resolving www.ccs.neu.edu (www.ccs.neu.edu)... 52.70.229.197
Connecting to www.ccs.neu.edu (www.ccs.neu.edu)|52.70.229.197|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1408915 (1.3M)
Saving to: ‘links.tab’


2021-04-16 23:03:08 (28.2 MB/s) - ‘links.tab’ saved [1408915/1408915]



This `links` variable consists of triples of `(English term, language, term in that language)`. For example, here is the link between English `academy` and French `académie`:

In [13]:
links[302]

['academy', 'fr', 'académie']

**TODO**: Evaluate the English and French embeddings by computing the proportion of English Wikipedia articles whose corresponding French article is also the closest word in embedding space. Skip English articles not covered by the word embedding dictionary. Since many articles, e.g., about named entities have the same title in English and French, compute the baseline accuracy achieved by simply echoing the English title as if it were French. Remember to iterate only over English Wikipedia articles, not the entire embedding dictionary.

In [15]:
## TODO: Compute English-French Wikipedia retrieval accuracy.
accuracy = 0
baselineAccuracy = 0

# Calculate the baseline accuracy of French
words_lang = [link for link in links if link[1] == 'fr']
n = len(words_lang)
BaselineAccuracy = sum(word[0] == word[2] for word in words_lang)/n
print(BaselineAccuracy)

# calculate the accuracy of French
count = 0
for word in words_lang:
  similar_word, similarity = mostSimilar(envec[word[0]], frvec)
  if similar_word == word[2]:
    count += 1
accuracy = count/n
print(accuracy)

0.6742324450298915
0.5359205593271862


**TODO**: Compute accuracy and baseline (identity function) acccuracy for Englsih and another language besides French. Although the baseline will be lower for languages not written in the Roman alphabet (i.e., Arabic or Chinese), there are still many articles in those languages with headwords written in Roman characters.

In [16]:
## TODO: Compute English-German Wikipedia retrieval accuracy.
accuracy_1 = 0
baselineAccuracy_1 = 0

# Calculate the baseline accuracy of German
words_lang_1 = [link for link in links if link[1] == 'de']
n = len(words_lang_1)
BaselineAccuracy_1 = sum(word[0] == word[2] for word in words_lang_1)/n
print(BaselineAccuracy_1)

# calculate the accuracy of German
count_1 = 0
for word in words_lang_1:
  similar_word, similarity = mostSimilar(envec[word[0]], devec)
  if similar_word == word[2]:
    count_1 += 1

accuracy_1 = count_1/n
print(accuracy_1)


0.6652764220427722
0.36959801224598454


In [17]:
## TODO: Compute English-Chinese Wikipedia retrieval accuracy.
accuracy_2 = 0
baselineAccuracy_2 = 0

# Calculate the baseline accuracy of Chinese
words_lang_2 = [link for link in links if link[1] == 'zh']
n = len(words_lang_2)
BaselineAccuracy_2 = sum(word[0] == word[2] for word in words_lang_2)/n
print(BaselineAccuracy_2)

# calculate the accuracy of Chinese
count_2 = 0
for word in words_lang_2:
  similar_word, similarity = mostSimilar(envec[word[0]], zhvec)
  if similar_word == word[2]:
    count_2 += 1

accuracy_2 = count_2/n
print(accuracy_2)


0.06740602045225406
0.13639637044505257


In [18]:
## TODO: Compute English-Arabic Wikipedia retrieval accuracy.
accuracy_3 = 0
baselineAccuracy_3 = 0

# Calculate the baseline accuracy of Arabic
words_lang_3 = [link for link in links if link[1] == 'ar']
n = len(words_lang_3)
BaselineAccuracy_3 = sum(word[0] == word[2] for word in words_lang_3)/n
print(BaselineAccuracy_3)

# calculate the accuracy of Arabic
count_3 = 0
for word in words_lang_3:
  similar_word, similarity = mostSimilar(envec[word[0]], arvec)
  if similar_word == word[2]:
    count_3 += 1

accuracy_3 = count_3/n
print(accuracy_3)

0.006582155046318869
0.2067284251584593


In [19]:
## TODO: Compute English-Portuguese Wikipedia retrieval accuracy.
accuracy_4 = 0
baselineAccuracy_4 = 0

# Calculate the baseline accuracy of Portuguese
words_lang_4 = [link for link in links if link[1] == 'pt']
n = len(words_lang_4)
BaselineAccuracy_4 = sum(word[0] == word[2] for word in words_lang_4)/n
print(BaselineAccuracy_4)

# calculate the accuracy of Portuguese
count_4 = 0
for word in words_lang_4:
  similar_word, similarity = mostSimilar(envec[word[0]], ptvec)
  if similar_word == word[2]:
    count_4 += 1

accuracy_4 = count_4/n
print(accuracy_4)

0.5266257963697559
0.49465079937492484


In [20]:
## TODO: Compute English-Russian Wikipedia retrieval accuracy.
accuracy_5 = 0
baselineAccuracy_5 = 0

# Calculate the baseline accuracy of Russian
words_lang_5 = [link for link in links if link[1] == 'ru']
n = len(words_lang_5)
BaselineAccuracy_5 = sum(word[0] == word[2] for word in words_lang_5)/n
print(BaselineAccuracy_5)

# calculate the accuracy of Russian
count_5 = 0
for word in words_lang_5:
  similar_word, similarity = mostSimilar(envec[word[0]], ruvec)
  if similar_word == word[2]:
    count_5 += 1

accuracy_5 = count_5/n
print(accuracy_5)

0.07121905432359497
0.19183386131643787


In [21]:
## TODO: Compute English-Spanish Wikipedia retrieval accuracy.
accuracy_6 = 0
baselineAccuracy_6 = 0

# Calculate the baseline accuracy of Spanish
words_lang_6 = [link for link in links if link[1] == 'es']
n = len(words_lang_6)
BaselineAccuracy_6 = sum(word[0] == word[2] for word in words_lang_6)/n
print(BaselineAccuracy_6)

# calculate the accuracy of Spanish
count_6 = 0
for word in words_lang_6:
  similar_word, similarity = mostSimilar(envec[word[0]], esvec)
  if similar_word == word[2]:
    count_6 += 1

accuracy_6 = count_6/n
print(accuracy_6)

0.5173403193612774
0.5432884231536926


In [22]:
## TODO: Compute English-Vietnamese Wikipedia retrieval accuracy.
accuracy_7 = 0
baselineAccuracy_7 = 0

# Calculate the baseline accuracy of Vietnamese
words_lang_7 = [link for link in links if link[1] == 'vi']
n = len(words_lang_7)
BaselineAccuracy_7 = sum(word[0] == word[2] for word in words_lang_7)/n
print(BaselineAccuracy_7)

# calculate the accuracy of Vietnamese
count_7 = 0
for word in words_lang_7:
  similar_word, similarity = mostSimilar(envec[word[0]], vivec)
  if similar_word == word[2]:
    count_7 += 1

accuracy_7 = count_7/n
print(accuracy_7)

0.7713297463489623
0.5780169100691775


Further evaluation, if you are interested, could involve looking at the $k$ nearest neighbors of each English term to compute "recall at 10" or "mean reciprocal rank at 10".