# So, show me how to align two vector spaces for myself!

No problem. We're going to run through the example given in the README again, and show you how to learn your own transformation to align the French vector space to the Russian vector space.

First, let's define a few simple functions...

In [106]:
import numpy as np
from fasttext import FastVector

# from https://stackoverflow.com/questions/21030391/how-to-normalize-array-numpy
def normalized(a, axis=-1, order=2):
    """Utility function to normalize the rows of a numpy array."""
    l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
    l2[l2==0] = 1
    return a / np.expand_dims(l2, axis)

def make_training_matrices(source_dictionary, target_dictionary, bilingual_dictionary):
    """
    Source and target dictionaries are the FastVector objects of
    source/target languages. bilingual_dictionary is a list of 
    translation pair tuples [(source_word, target_word), ...].
    """
    source_matrix = []
    target_matrix = []

    for (source, target) in bilingual_dictionary:
        #print source,target
        if source in source_dictionary and target in target_dictionary:
            #print source, target
            source_matrix.append(source_dictionary[source])
            target_matrix.append(target_dictionary[target])

    # return training matrices
    return np.array(source_matrix), np.array(target_matrix)

def learn_transformation(source_matrix, target_matrix, normalize_vectors=True):
    """
    Source and target matrices are numpy arrays, shape
    (dictionary_length, embedding_dimension). These contain paired
    word vectors from the bilingual dictionary.
    """
    # optionally normalize the training vectors
    if normalize_vectors:
        source_matrix = normalized(source_matrix)
        print len(source_matrix.transpose())
        target_matrix = normalized(target_matrix)

    # perform the SVD
    product = np.matmul(source_matrix.transpose(), target_matrix)
    U, s, V = np.linalg.svd(product)

    # return orthogonal transformation which aligns source language to the target
    return np.matmul(U, V)

Now we load the English and Micmaq word vectors, and evaluate the similarity of "one" and "newt":

In [24]:
en_dictionary = FastVector(vector_file='/home/apatra/fastText/fastText_multilingual-master/eng.vec')
mi_dictionary = FastVector(vector_file='/home/apatra/fastText/fastText_multilingual-master/mic.vec')

en_vector = en_dictionary["one"]
mi_vector = mi_dictionary["newt"]
print(FastVector.cosine_similarity(en_vector, mi_vector))

reading word vectors from /home/apatra/fastText/fastText_multilingual-master/eng.vec
reading word vectors from /home/apatra/fastText/fastText_multilingual-master/mic.vec
-0.05159327789053213


"chat" and "кот" both mean "cat", so they should be highly similar; clearly the two word vector spaces are not yet aligned. To align them, we need a bilingual dictionary of French and Russian translation pairs. As it happens, this is a great opportunity to show you something truly amazing...

Many words appear in the vocabularies of more than one language; words like "alberto", "london" and "presse". These words usually mean similar things in each language. Therefore we can form a bilingual dictionary, by simply extracting every word that appears in both the French and Russian vocabularies.

In [22]:
print mi_dictionary.word2id.keys()

['aklasiewaloqsultiji', 'esma\xe2\x80\x99titl', 'gelulatl', 'etlitum', "gi'lewei", 'ta\xe2\x80\x99ntelmilamuksit', 'litaqnwikas\xe2\x80\x99kl', 'mimajuinuigtug', 'atkitemi', 'mikwite\xe2\x80\x99tmu', 'kaqimaliaptiksip', 'maqamikekkiskuk', 'kiwto\xe2\x80\x99qa\xe2\x80\x99sikewe\xe2\x80\x99l', 'elakutultijik', "apgwa'tuan", 'jijgluewjig', 'iknemat', 'chiasso', 'pilue\xe2\x80\x99', "pma'toq", 'nemiatijel', "wels'tua'tigul", "we'wasitew", 'wtuisunmual', 'iknemaj', 'wuji', "wejgwa'lawoqo", 'teke\xe2\x80\x99k', "nasgua'tiji", 'telkisi', 'mawelkisnl', 'mawelkisni', "a'tugwaqa", 'tetpoqpilaqn', 'china', "pilato'q", 'westawu\xe2\x80\x99lkw', 'pajjoqe\xe2\x80\x99kemk', 'kidi', 'kisilukewknew', 'kisikuitl', "ukwita'q", 'apu\xe2\x80\x99ksinew', 'seskutaq', "la'lukete", 'wije\xe2\x80\x99wmi\xe2\x80\x99ti', "ta'n", 'battiste', "un'jan", "gji'jiaqat", 'nutaptukl', "a'so'q", '(rattle)', 'ntawa\xe2\x80\x99qa\xe2\x80\x99taqatijik', 'pemi______________elkomiktule', 'nutankuat', 'weleywioqipetlin\xe2\x80\

In [7]:
mi_words = set(mi_dictionary.word2id.keys())
en_words = set(en_dictionary.word2id.keys())
overlap = list(mi_words & en_words)
bilingual_dictionary = [(entry, entry) for entry in overlap]

In [38]:
#print len(en_vector)
#print mi_dictionary.word2id.keys()
#print mi_words
#print overlap
#for x in en_dictionary.word2id.keys():
#print x
#print len(en_dictionary.word2id.keys())
#print en_words
print bilingual_dictionary

[('writings', 'writings'), ('andre', 'andre'), ('negl', 'negl'), ('captain', 'captain'), ('paris', 'paris'), ('aul', 'aul'), ('edward', 'edward'), ('wji', 'wji'), ('saskatchewan', 'saskatchewan'), ('phial', 'phial'), ('jack', 'jack'), ('school', 'school'), ('whitman', 'whitman'), ('wooden', 'wooden'), ('herbert', 'herbert'), ('burke', 'burke'), ('enta', 'enta'), ('mius', 'mius'), ('nige', 'nige'), ('horu', 'horu'), ('lavery', 'lavery'), ('galal', 'galal'), ('asim', 'asim'), ('wuji', 'wuji'), ('asik', 'asik'), ('nek', 'nek'), ('neh', 'neh'), ('neg', 'neg'), ('kuo', 'kuo'), ('new', 'new'), ('net', 'net'), ('nes', 'nes'), ('nep', 'nep'), ('patu', 'patu'), ('jemu', 'jemu'), ('atel', 'atel'), ('men', 'men'), ('suga', 'suga'), ('mej', 'mej'), ('dore', 'dore'), ('china', 'china'), ('mes', 'mes'), ('celebration', 'celebration'), ('mima', 'mima'), ('k', 'k'), ('robin', 'robin'), ('digita', 'digita'), ('julia', 'julia'), ('lynne', 'lynne'), ('brought', 'brought'), ('koju', 'koju'), ('cocaine', '

In [67]:
import codecs
bilingual_dictionary=[]
with codecs.open('/home/apatra/fastText/fastText_multilingual-master/eng-mic','r','utf-8') as f:
    for line in f:
        eng, mic=line.split(', ')
        #print eng
        eng=eng.strip('\"')
        #print eng
        mic=mic.strip('\"')
        mic=mic.replace('\n','')
        mic=mic.replace('"','')
        #print eng, mic
        bilingual_dictionary.append((eng,mic))
print bilingual_dictionary

[(u'txt', u'txt'), (u'all', u'ms\u02bct'), (u'choose', u'megnatl'), (u'choose', u'megng'), (u'German', u'alman'), (u'good', u'amiglu\u02bcsit'), (u'good', u'amiglu\u02bclg'), (u'good', u'gelu\u02bclg'), (u'good', u'gelu\u02bcsit'), (u'goodbye', u'atiu'), (u'I', u'nin'), (u'May', u'Sqoljuigu\u2019s'), (u'May', u'Sqoljuigu\u02bcs'), (u'Micmac', u'Mi\u2019gmaq'), (u'Micmac', u'Mi\u2019gmawi\u2019simg'), (u'Micmac', u'Mi\u02bcgmaq'), (u'Micmac', u'Mi\u02bcgmaw'), (u'Micmac', u'm\xedkmaq'), (u'Micmac', u'Mi\u02bckmaq'), (u'Micmac', u'm\xedgmaq'), (u'Mi\u2019kmaq', u'M\xedkmaw\xedsimk'), (u'Mi\u2019kmaq', u'Mi\u02bcgmaq'), (u'Mi\u2019kmaq', u'm\xedkmaq'), (u'Mi\u2019kmaq', u'Mi\u02bckmaq'), (u'Mi\u2019kmaq', u'm\xedgmaq'), (u'Mohawk', u'gwatej'), (u'Newfoundland', u'Taqamkuk'), (u'search', u'gwilg'), (u'search', u'gwiluasit'), (u'stop', u'enqa\u02bclatl'), (u'stop', u'enqa\u02bcs\u02bcg'), (u'translation', u'nesutmalsewu\u02bcti'), (u'elderly woman', u'gisigui\u02bcsgw'), (u'is that so', u't

Let's align the French vectors to the Russian vectors, using only this "free" dictionary that we acquired without any bilingual expert knowledge.

In [107]:
# form the training matrices
from copy import deepcopy
source_matrix, target_matrix = make_training_matrices(
    en_dictionary, mi_dictionary, bilingual_dictionary)
print len(source_matrix), len(target_matrix)
# learn and apply the transformation

#target_matrix=deepcopy(source_matrix)
print source_matrix[60][9], target_matrix[60][9]
transform = learn_transformation(source_matrix, target_matrix)
#print type(transform)
#print transform[299]
en_dictionary.apply_transform(transform)

289 289
-0.1425611234868448 -0.35762
300


Finally, we re-evaluate the similarity of "chat" and "кот":

In [89]:
en_vector = en_dictionary["one"]
mi_vector = mi_dictionary["newt"]
print(FastVector.cosine_similarity(en_vector, mi_vector))

0.5346879727300464


"chat" and "кот" are pretty similar after all :)

Use this simple "identical strings" trick to align other language pairs for yourself, or prepare your own expert bilingual dictionaries for optimal performance.