# Translation Matrix Tutorial

## What is it ?

Suppose we are given a set of word pairs and their associated vector representaion $\{x_{i},z_{i}\}_{i=1}^{n}$, where $x_{i} \in R^{d_{1}}$ is the distibuted representation of word $i$ in the source language, and ${z_{i} \in R^{d_{2}}}$ is the vector representation of its translation. Our goal is to find a transformation matrix $W$ such that $Wx_{i}$ approximates $z_{i}$. In practice, $W$ can be learned by the following optimization prolem:

<center>$\min \limits_{W} \sum \limits_{i=1}^{n} ||Wx_{i}-z_{i}||^{2}$</center>

## Resources

Tomas Mikolov, Quoc V Le, Ilya Sutskever. 2013.[Exploiting Similarities among Languages for Machine Translation](https://arxiv.org/pdf/1309.4168.pdf)

Georgiana Dinu, Angelikie Lazaridou and Marco Baroni. 2014.[Improving zero-shot learning by mitigating the hubness problem](https://arxiv.org/pdf/1309.4168.pdf)

This notebook is to show how to find the translation matrix using fasttext and is inspired by the gensim translation matrix notebook.

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb

In [1]:
import os

from gensim import utils
from gensim.models import translation_matrix
from gensim.models import KeyedVectors
from gensim.models.fasttext import FastText
import warnings
warnings.filterwarnings("ignore")

For this tutorial, we'll train our model using the English -> Italian word pairs from the OPUS collection. This corpus contains 5000 word pairs. Each word pair is English word with corresponding Italian word.

Dataset download:

In [2]:
!wget -nc https//s3.amazonaws.com/arrival/dictionaries/OPUS_en_it_europarl_train_5K.txt

File 'OPUS_en_it_europarl_train_5K.txt' already there; not retrieving.



This tutorial uses 300-dimensional vectors of English words as source and vectors of Italian words as target. (Those vector trained by the fasttext toolkit with cbow. The context window was set 5 words to either side of the target,
the sub-sampling option was set to 1e-05 and estimate the probability of a target word with the negative sampling method, drawing 10 samples from the noise distribution)

Download dataset:

[EN.200K.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt](https://pan.baidu.com/s/1nv3bYel)

[IT.200K.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt](https://pan.baidu.com/s/1boP0P7D)

Although fasttext supports out of vocabulary files, still translation matrices works only with vec files as the underlying vocabulary is accessed. In case you have out of vocabulary words then the best course of action would probably be to edit the vec files to accomodate those OOV words as well.

In [3]:
# Load the source language word vector
source_word_vec_file = "wiki.simple.vec"
source_word_vec = KeyedVectors.load_word2vec_format(source_word_vec_file, binary=False)

In [4]:
# Load the target language word vector
target_word_vec_file = "wiki.it.vec"
target_word_vec = KeyedVectors.load_word2vec_format(target_word_vec_file, binary=False)

In [5]:
train_file = "OPUS_en_it_europarl_train_5K.txt"

with utils.smart_open(train_file, "r") as f:
    word_pair = [tuple(utils.to_unicode(line).strip().split()) for line in f]
    word_pair = [wp for wp in word_pair if wp[0] in source_word_vec and wp[1] in target_word_vec]
print (word_pair[:10])

[('for', 'per'), ('that', 'che'), ('with', 'con'), ('are', 'are'), ('are', 'sono'), ('this', 'questa'), ('this', 'questo'), ('you', 'lei'), ('not', 'non'), ('which', 'che')]


Train the translation matrix

In [6]:
transmat = translation_matrix.TranslationMatrix(source_word_vec, target_word_vec, word_pair)
transmat.train(word_pair)
print ("the shape of translation matrix is: ", transmat.translation_matrix.shape)

the shape of translation matrix is:  (300, 300)


Prediction Time: For any given new word, we can map it to the other language space by coputing $z = Wx$, then we find the word whose representation is closet to z in the target language space, using consine similarity as the distance metric.

#### Part one:
Let's look at some vocabulary of numbers translation. We use English words (one, two, three, four and five) as test.

In [7]:
# The pair is in the form of (English, Italian), we can see whether the translated word is correct
words = [("one", "uno"), ("two", "due"), ("three", "tre"), ("four", "quattro"), ("five", "cinque")]
source_word, target_word = zip(*words)
translated_word = transmat.translate(source_word, 5, )

In [8]:
for k, v in translated_word.items():
    print ("word ", k, " and translated word", v)

word  one  and translated word ['み', 'anche', 'cui', 'tre', 'due']
word  two  and translated word ['due', 'tre', 'quattro', 'cinque', 'nove']
word  three  and translated word ['tre', 'quattro', 'cinque', 'due', 'sette']
word  four  and translated word ['quattro', 'tre', 'cinque', 'due', 'nove']
word  five  and translated word ['cinque', 'quattro', 'tre', 'dieci', 'nove']


#### Part two:
Let's look at some vocabulary of fruits translation. We use English words (apple, orange, grape, banana and mango) as test.

In [9]:
words = [("apple", "mela"), ("orange", "arancione"), ("grape", "acino"), ("banana", "banana"), ("mango", "mango")]
source_word, target_word = zip(*words)
translated_word = transmat.translate(source_word, 5)
for k, v in translated_word.items():
    print ("word ", k, " and translated word", v)

word  apple  and translated word ['apple', 'microsoft®', 'microsoft', 'pinkberry', 'applesoft']
word  orange  and translated word ['arancio', 'arancione', 'colore', 'giallo/arancione', 'aranciato']
word  grape  and translated word ['ortaggio', 'albicocche', 'ortaggi', 'carciofolata', 'arance']
word  banana  and translated word ['arachidi', 'riso', 'frutta', 'manioca', 'cereali']
word  mango  and translated word ['cardamomo', 'granturco', 'anacardi', 'riso', 'manioca']


#### Part three:
Let's look at some vocabulary of animals translation. We use English words (dog, pig, cat, horse and bird) as test.

In [10]:
words = [("dog", "cane"), ("pig", "maiale"), ("cat", "gatto"), ("fish", "cavallo"), ("birds", "uccelli")]
source_word, target_word = zip(*words)
translated_word = transmat.translate(source_word, 5)
for k, v in translated_word.items():
    print ("word ", k, " and translated word", v)

word  dog  and translated word ['cane', 'cani', 'randagio', 'cagnolino', 'catdog']
word  pig  and translated word ['dognipelo', 'mangiasciutti', 'animaletto', 'maiale', 'trituratore']
word  cat  and translated word ['gatto', 'gattino', 'scimmietta', 'scodinzolante', 'cagnolino']
word  fish  and translated word ['gamberetti', 'pesce', 'gamberetto', 'gamberi', 'pesci']
word  birds  and translated word ['uccelli', 'insettivori', 'animali', 'rettili', 'mammiferi']


### The Creation Time for the Translation Matrix

Testing the creation time, we extracted more word pairs from a dictionary built from Europarl([Europara, en-it](http://opus.lingfil.uu.se/)). We obtain about 20K word pairs and their coresponding word vectors or you can download from this.[word_dict.pkl](https://pan.baidu.com/s/1dF8HUX7)

In [11]:
# # Uncomment this if you want to store this
# import pickle
# word_dict = "word_dict.pkl"
# with utils.smart_open(word_dict, "r") as f:
#     word_pair = pickle.load(f)
# print ("the length of word pair ", len(word_pair))

In [15]:
import time

test_case = 10
word_pair_length = len(word_pair)
step = int(word_pair_length / test_case)

duration = []
sizeofword = []

for idx in range(0, test_case):
    sub_pair = word_pair[:(idx + 1)*step]

    startTime = time.time()
    transmat = translation_matrix.TranslationMatrix(source_word_vec, target_word_vec, sub_pair)
    transmat.train(sub_pair)
    endTime = time.time()
    
    sizeofword.append(len(sub_pair))
    duration.append(endTime - startTime)

In [16]:
import plotly
from plotly.graph_objs import Scatter, Layout

plotly.offline.init_notebook_mode(connected=True)

plotly.offline.iplot({
    "data": [Scatter(x=sizeofword, y=duration)],
    "layout": Layout(title="time for creation"),
}, filename="tm_creation_time.html")

You will see a two dimensional coordination whose horizontal axis is the size of corpus and vertical axis is the time to train a translation matrix (the unit is second). As the size of corpus increases, the time increases linearly.

### Linear Relationship Between Languages

To have a better understanding of the principles behind, we visualized the word vectors using PCA, we noticed that the vector representations of similar words in different languages were related by a linear transformation.

In [17]:
from sklearn.decomposition import PCA

import plotly
from plotly.graph_objs import Scatter, Layout, Figure
plotly.offline.init_notebook_mode(connected=True)

In [18]:
words = [("one", "uno"), ("two", "due"), ("three", "tre"), ("four", "quattro"), ("five", "cinque")]
en_words_vec = [source_word_vec[item[0]] for item in words]
it_words_vec = [target_word_vec[item[1]] for item in words]

en_words, it_words = zip(*words)

pca = PCA(n_components=2)
new_en_words_vec = pca.fit_transform(en_words_vec)
new_it_words_vec = pca.fit_transform(it_words_vec)

In [19]:
# you can also using plotly lib to plot in one figure
trace1 = Scatter(
    x = new_en_words_vec[:, 0],
    y = new_en_words_vec[:, 1],
    mode = 'markers+text',
    text = en_words,
    textposition = 'top'
)
trace2 = Scatter(
    x = new_it_words_vec[:, 0],
    y = new_it_words_vec[:, 1],
    mode = 'markers+text',
    text = it_words,
    textposition = 'top'
)
layout = Layout(
    showlegend = False
)
data = [trace1, trace2]

fig = Figure(data=data, layout=layout)
plot_url = plotly.offline.iplot(fig, filename='relatie_position_for_number.html')

The figure shows that the word vectors for English number one to five and the corresponding Italian words uno to cinque have similar geometric arrangements. So the relationship between vector spaces that represent these two languages can be captured by linear mapping. 
If we know the translation of one to four from English to Italian, we can learn the transformation matrix that can help us to translate five or other numbers to the Italian word.

In [20]:
words = [("one", "uno"), ("two", "due"), ("three", "tre"), ("four", "quattro"), ("five", "cinque")]
en_words, it_words = zip(*words)
en_words_vec = [source_word_vec[item[0]] for item in words]
it_words_vec = [target_word_vec[item[1]] for item in words]

# Translate the English word five to Italian word
translated_word = transmat.translate([en_words[4]], 3)
print ("translation of five: ", translated_word)

# the translated words of five
for item in translated_word[en_words[4]]:
    it_words_vec.append(target_word_vec[item])

pca = PCA(n_components=2)
new_en_words_vec = pca.fit_transform(en_words_vec)
new_it_words_vec = pca.fit_transform(it_words_vec)

translation of five:  OrderedDict([('five', ['cinque', 'quattro', 'tre'])])


In [21]:
trace1 = Scatter(
    x = new_en_words_vec[:, 0],
    y = new_en_words_vec[:, 1],
    mode = 'markers+text',
    text = en_words,
    textposition = 'top'
)
trace2 = Scatter(
    x = new_it_words_vec[:, 0],
    y = new_it_words_vec[:, 1],
    mode = 'markers+text',
    text = it_words,
    textposition = 'top'
)
layout = Layout(
    showlegend = False,
    annotations = [dict(
        x = new_it_words_vec[5][0],
        y = new_it_words_vec[5][1],
        text = translated_word[en_words[4]][0],
        arrowcolor = "black",
        arrowsize = 1.5,
        arrowwidth = 1,
        arrowhead = 0.5
      ), dict(
        x = new_it_words_vec[6][0],
        y = new_it_words_vec[6][1],
        text = translated_word[en_words[4]][1],
        arrowcolor = "black",
        arrowsize = 1.5,
        arrowwidth = 1,
        arrowhead = 0.5
      ), dict(
        x = new_it_words_vec[7][0],
        y = new_it_words_vec[7][1],
        text = translated_word[en_words[4]][2],
        arrowcolor = "black",
        arrowsize = 1.5,
        arrowwidth = 1,
        arrowhead = 0.5
      )]
)
data = [trace1, trace2]

fig = Figure(data=data, layout=layout)
plot_url = plotly.offline.iplot(fig, filename='relatie_position_for_numbers.html')

You probably will see that two kind of different color nodes, one for the English and the other for the Italian. For the translation of word `five`, we return `top 3` similar words `[u'cinque', u'quattro', u'tre']`. We can easily see that the translation is convincing.

Let's see some animal words, the figue shows that most of words are also share the similar geometric arrangements.

In [22]:
words = [("dog", "cane"), ("pig", "maiale"), ("cat", "gatto"), ("horse", "cavallo"), ("birds", "uccelli")]
en_words_vec = [source_word_vec[item[0]] for item in words]
it_words_vec = [target_word_vec[item[1]] for item in words]

en_words, it_words = zip(*words)

In [23]:
trace1 = Scatter(
    x = new_en_words_vec[:, 0],
    y = new_en_words_vec[:, 1],
    mode = 'markers+text',
    text = en_words,
    textposition = 'top'
)
trace2 = Scatter(
    x = new_it_words_vec[:, 0],
    y = new_it_words_vec[:, 1],
    mode = 'markers+text',
    text = it_words,
    textposition ='top'
)
layout = Layout(
    showlegend = False
)
data = [trace1, trace2]

fig = Figure(data=data, layout=layout)
plot_url = plotly.offline.iplot(fig, filename='relatie_position_for_animal.html')

In [24]:
words = [("dog", "cane"), ("pig", "maiale"), ("cat", "gatto"), ("horse", "cavallo"), ("birds", "uccelli")]
en_words, it_words = zip(*words)
en_words_vec = [source_word_vec[item[0]] for item in words]
it_words_vec = [target_word_vec[item[1]] for item in words]

# Translate the English word birds to Italian word
translated_word = transmat.translate([en_words[4]], 3)
print ("translation of birds: ", translated_word)

# the translated words of birds
for item in translated_word[en_words[4]]:
    it_words_vec.append(target_word_vec[item])

pca = PCA(n_components=2)
new_en_words_vec = pca.fit_transform(en_words_vec)
new_it_words_vec = pca.fit_transform(it_words_vec)

translation of birds:  OrderedDict([('birds', ['uccelli', 'insettivori', 'animali'])])


In [25]:
trace1 = Scatter(
    x = new_en_words_vec[:, 0],
    y = new_en_words_vec[:, 1],
    mode = 'markers+text',
    text = en_words,
    textposition = 'top'
)
trace2 = Scatter(
    x = new_it_words_vec[:5, 0],
    y = new_it_words_vec[:5, 1],
    mode = 'markers+text',
    text = it_words[:5],
    textposition = 'top'
)
layout = Layout(
    showlegend = False,
    annotations = [dict(
        x = new_it_words_vec[5][0],
        y = new_it_words_vec[5][1],
        text = translated_word[en_words[4]][0],
        arrowcolor = "black",
        arrowsize = 1.5,
        arrowwidth = 1,
        arrowhead = 0.5
      ), dict(
        x = new_it_words_vec[6][0],
        y = new_it_words_vec[6][1],
        text = translated_word[en_words[4]][1],
        arrowcolor = "black",
        arrowsize = 1.5,
        arrowwidth = 1,
        arrowhead = 0.5
      ), dict(
        x = new_it_words_vec[7][0],
        y = new_it_words_vec[7][1],
        text = translated_word[en_words[4]][2],
        arrowcolor = "black",
        arrowsize = 1.5,
        arrowwidth = 1,
        arrowhead = 0.5
      )]
)
data = [trace1, trace2]

fig = Figure(data=data, layout=layout)
plot_url = plotly.offline.iplot(fig, filename='relatie_position_for_animal.html')

You probably will see that two kind of different color nodes, one for the English and the other for the Italian. For the translation of word `birds`, we return `top 3` similar words `[u'uccelli', u'garzette', u'iguane']`. We can easily see that the animals' words translation is also convincing as the numbers.