# Cross-lingual embeddings

In this lab we will explore multilingual word embeddings and build a very rudimentary translation system.

Adapted from: https://github.com/facebookresearch/MUSE/blob/master/demo.ipynb

## Theory
Cross-lingual embedding vectors can be trained in supervised or unsupervised way.   

**Supervised.**  

First, the embeddings are trained for each language separately.   
Then optimization problem of aligning the embeddings is solved using a seed lexicon (small number of aligned pairs of words) by minimizing square loss with cross-domain similarity local scaling (CSLS) criterion.  
More details in [Joulin et al (2018)](https://arxiv.org/pdf/1804.07745.pdf)

**Unsupervised.**

First, a rotation matrix $W$ which roughly aligns the
two distributions is learnt using adversarial learning.   
Second, the mapping $W$ is further refined: frequent words aligned by the previous step are used as anchor points, and an energy function that corresponds to a spring system between anchor points is minimized (Procrustes method).   
Finally, they translate by using the mapping $W$ and a distance metric (CSLS) that expands the space where there is high density of points (like the area around the word
“cat”), so that “hubs” (like the word “cat”) become less close to other word vectors than they would otherwise.

![](https://drive.google.com/uc?export=view&id=1IuI4NGiUMUtS5whr_mldnsL5tFVoKtwH)


More details in [Conneau et al (2018)](https://arxiv.org/pdf/1710.04087.pdf)

## Practice

### Load embeddings
Here we load a subset of 100,000 + 100,000 aligned [fastText embeddings](https://fasttext.cc/docs/en/aligned-vectors.html) for English and Russian languages. 

In [0]:
!wget 'https://drive.google.com/uc?export=download&id=1-Hrc2uz14kmcsKYle7_penmpR7t0TtZR' -O en_embeddings.npz
!wget 'https://drive.google.com/uc?export=download&id=1-ZnGxODZypEnz5E0ssXLCEMG-fSfU28W' -O en_word2id.p

!wget 'https://drive.google.com/uc?export=download&id=1-OE9Tw8M5jWvM-4WRKfladaQzLIgoxiT' -O ru_embeddings.npz
!wget 'https://drive.google.com/uc?export=download&id=1-Y42yEnIsrQtVdQ7PABrvrn4eYRKmyuG' -O ru_word2id.p

!wget 'https://drive.google.com/uc?export=download&id=1-TsynEry2jdbIY2P3_c7UjHdbwr345Nf' -O secret_embeddings.npy

In [0]:
import numpy as np
import pickle
import pandas as pd

In [0]:
# load the files
x = np.load('en_embeddings.npz', allow_pickle=True)
en_embeddings = [x[k] for k in x][0]

x = np.load('ru_embeddings.npz', allow_pickle=True)
ru_embeddings = [x[k] for k in x][0]

with open('en_word2id.p', 'rb') as handle:
    en_word2id = pickle.load(handle)

with open('ru_word2id.p', 'rb') as handle:
    ru_word2id = pickle.load(handle)

# create id2word for both languages
en_id2word = [None] * len(en_word2id)
for word, idx in en_word2id.items():
    en_id2word[idx] = word
ru_id2word = [None] * len(ru_word2id)
for word, idx in ru_word2id.items():
    ru_id2word[idx] = word

### Visualize multilingual embeddings

Let's visualize the embeddings. We take pairs of words that have same meaning, where one is English and the other is Russian. As they exist in 300-dimensional space which is hard to imagine, we need to project them to a 2D space. We will use the first two components of PCA to do this. 

In [0]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2, whiten=True)
pca.fit(np.vstack([en_embeddings, ru_embeddings]))
print('Variance explained: %.2f' % pca.explained_variance_ratio_.sum())

In [0]:
import matplotlib.pyplot as plt

def plot_similar_word(src_words, src_word2id, src_emb, tgt_words, tgt_word2id, tgt_emb, pca):

    Y = []
    word_labels = []
    for sw in src_words:
        Y.append(src_emb[src_word2id[sw]])
        word_labels.append(sw)
    for tw in tgt_words:
        Y.append(tgt_emb[tgt_word2id[tw]])
        word_labels.append(tw)

    # find PCA coords for 2 dimensions
    Y = pca.transform(Y)
    x_coords = Y[:, 0]
    y_coords = Y[:, 1]

    # display scatter plot
    plt.figure(figsize=(10, 8), dpi=80)
    plt.scatter(x_coords, y_coords, marker='x')

    for k, (label, x, y) in enumerate(zip(word_labels, x_coords, y_coords)):
        color = 'blue' if k < len(src_words) else 'red'  # src words in blue / tgt words in red
        plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points', fontsize=19,
                     color=color, weight='bold')
        
    for k in range(len(src_words)):
        idx_src = k
        idx_tgt = k + len(src_words)
        plt.plot([x_coords[idx_src], x_coords[idx_tgt]], [y_coords[idx_src], y_coords[idx_tgt]])

    plt.xlim(x_coords.min() - 0.2, x_coords.max() + 0.2)
    plt.ylim(y_coords.min() - 0.2, y_coords.max() + 0.2)
    plt.title('Visualization of the multilingual word embedding space')

    plt.show()

In [0]:
# get 5 random input words
en_words = ['university', 'love', 'history', 'tennis', 'research', 'conference']
ru_words = ['университет', 'любовь', 'история', 'теннис', 'исследование', 'конференция']

# assert words in dictionaries
for en_word in en_words:
    assert en_word in en_word2id, '"%s" not in source dictionary' % en_word
for ru_word in ru_words:
    assert ru_word in ru_word2id, '"%s" not in target dictionary' % ru_word

plot_similar_word(en_words, en_word2id, en_embeddings, ru_words, ru_word2id, ru_embeddings, pca)

### Get nearest neigbors for a given word

Let's write a function that given a word, returns its K nearest neighbors in the vector space.

In [0]:
def get_nn(word, src_emb, src_word2id, tgt_emb, tgt_id2word, K=5):
    # 1. Look up the word embedding
    word_emb = src_emb[src_word2id[word]]

    # 2. Compute the scores for each word
    scores = tgt_emb.dot(word_emb)

    # 3. Find the index of the top K best scoring words
    # 4. Get the corresponding top K words
    k_best = pd.Series(scores, index=tgt_id2word).sort_values(ascending=False)[:K]

    return k_best

def print_k_best_for_word(k_best, word):
    print("Nearest neighbors of \"%s\":" % word)
    for word, score in k_best.items():
        print('%.4f - %s' % (score, word))

    print()

In [0]:
en_words = ["algorithm", "language", "research"]
for word in en_words:
    k_best = get_nn(word, en_embeddings, en_word2id, en_embeddings, en_id2word, K=6)
    print_k_best_for_word(k_best, word)

In [0]:
en_words = ["hello", "world"]
for word in en_words:
    k_best = get_nn(word, en_embeddings, en_word2id, en_embeddings, en_id2word, K=3)
    print_k_best_for_word(k_best, word)

We can also search in Russian and find the closest words in English!

In [0]:
ru_words = ["привет", "мир"]
for word in ru_words:
    k_best = get_nn(word, ru_embeddings, ru_word2id, en_embeddings, en_id2word, K=3)
    print_k_best_for_word(k_best, word)

In [0]:
en_words = ["hello", "world"]
for word in en_words:
    k_best = get_nn(word, en_embeddings, en_word2id, ru_embeddings, ru_id2word, K=3)
    print_k_best_for_word(k_best, word)

### A simple word-to-word translation system

We can try to use aligned embeddings to build a very rudimentary translation system.

Here is a list of texts in Russian. We will parse it, convert to lowercase, remove all special symbols, and split to words.   
Then for each word, if it exists in `ru_word2id`, translate it to english (using closest english word), otherwise skip it.

In [0]:
ru_texts = [
"""Игровое действие в американском футболе состоит из серии коротких по продолжительности отдельных схваток, за пределами которых мяч называют «мертвым» или не в игре. Во время схватки могут быть разыграны:
пасовая комбинация,
выносная комбинация,
пант ( удар по мячу ),
попытка взятия зачетной зоны
свободный удар (ввод мяча в игру – начальный удар)
Цель игры – набрать максимальное количество очков, занеся мяч в зачетную зону противника (тачдаун - touchdown) или забив его в ворота с поля (филд-гол – field goals). Побеждает команда, набравшая наибольшее количество очков.""",
"""Я вас любил: любовь ещё, быть может,
В душе моей угасла не совсем;
Но пусть она вас больше не тревожит;
Я не хочу печалить вас ничем.
Я вас любил безмолвно, безнадежно,
То робостью, то ревностью томим;
Я вас любил так искренно, так нежно,
Как дай вам Бог любимой быть другим""",
"""Сегодня мы говорим про слова и стоит обсудить, как делать такое сопоставление вектора слову.
Вернемся к предмету: вот у нас есть слова и есть компьютер, который должен с этими словами как-то работать. Вопрос — как компьютер будет работать со словами? Ведь компьютер не умеет читать, и вообще устроен сильно иначе, чем человек. Самая первая идея, приходящая в голову — просто закодировать слова цифрами по порядку следования в словаре.""",
"""Интернет-мем — информация в той или иной форме (медиаобъект, то есть объект, создаваемый электронными средствами коммуникации, фраза, концепция или занятие), как правило, остроумная и ироническая[2], спонтанно приобретающая популярность, распространяясь в Интернете разнообразными способами (посредством социальных сетей, форумов, блогов, мессенджеров и пр.). Обозначает также явление спонтанного распространения такой информации или фразы. 
Мемами могут считаться как слова, так и изображения. Иначе говоря, это любые высказывания, картинки, видео или звукоряд, которые имеют значение и устойчиво распространяются во Всемирной паутине.""",
]

In [0]:
import re

for ru_text in ru_texts:
    ru_text = re.sub(',(?!\s+\d$)', '', ru_text).lower()
    translation = []

    for word in re.split(r'(\s+)', ru_text):
        # Your code goes here
        if word.isspace():
            candidate = word
        elif word in ru_word2id:
            k_best = get_nn(word, ru_embeddings, ru_word2id, en_embeddings, en_id2word, K=1)
            candidate = list(k_best.keys())[0]
        else:
            candidate = word
        
        translation.append(candidate)

    print("".join(translation))
    print("\n===\n")

### Bonus: Find the missing words!

We secretly removed four English words from the English embedding data! However, we saved their embedding vectors. Can you recover the missing words?

*Hint: Find the nearest neighbors for each vector.*

*Hint: You can test if your guesses are correct be seeing if the word exists in the `en_word2id` dictionary.*

Once you have found them, you can DM me (Jason) on Campuswire with the missing words. Remember to include your NetID.

In [0]:
file_name = "secret_embeddings.npy"
secret_embeddings = np.load(file_name, allow_pickle=True)
secret_embeddings.shape

In [0]:
# Your code goes here
raise NotImplementedError