# Exercise 10: Multilingual Representation Learning

In this exercise, you will induce and explore multilingual text representations, cross-lingual word embeddings and multilingual encoders.

You should complete the parts of the exercise that are marked as **TODO**.
A correctly completed **TODO** gives 2 bonus points. Partially correct answers give 1 bonus point.
Some **TODO**s are inside a comment in a code block: Here, you should complete the line of code.
Other **TODO**s are inside a text block: Here, you should write a few sentences to answer the question.

**Important:** Some students were under the impression that you have to complete a TODO in a _single_ line of code. That is not the case, you can use as many lines as you need.

**Submission deadline:** 31.01.2022, 23:59 Central European Time

**Instructions for submission:** After completing the exercise, save a copy of the notebook as exercise10_multilinguality_MATRIKELNUMMER.ipynb, where MATRIKELNUMMER is your student ID number. Then upload the notebook to moodle (submission exercise sheet 10).

In order to understand the code, it can be helpful to experiment a bit during development, e.g., to print vectors, matrices, and tensors or their shapes. But please remove these changes before submitting the notebook. If we cannot run your notebook, or if a print statement is congesting stdout too much, then we cannot grade it. 

To make the most of this exercise, you should try to read and understand the entire code, not just the parts that contain a **TODO**. If you have questions, write them down for the exercise, which will happen in the week after the submission deadline.

**CUDA:** You can use a GPU for this exercise (on colab: Runtime -> Change Runtime Type -> GPU). This is not mandatory (nor particularly crucial for this exercise), but it may speed up the execution of the code. 

# Required libraries

When working with or any fast-changing software library, you should be extra careful to fix the library versions when you begin your project, and not change versions while you're developing.


## Data

This exercise requires various data files (monolingual embeddings, translation dictionaries, training and evaluation data for cross-lingual transfer for sentiment classification) which have been zipped and need to be obtained from: 

https://tinyurl.com/2p9x4crt

You need to place the content of the archive into a directory named **data** and place it in the same directory with this notebook (so that the files can be accessed by the code via a relative path "data/file_name"). 

In [None]:
!pip install sentence-transformers==2.1.0
!pip install numpy==1.20.0 # other numpy versions most likely also ok
!pip install pandas==1.2.2
!pip install scikit-learn==1.0

### Multilingual word embeddings

We will start by exploring multilingual word embedding spaces. We will start from monolingually-trained (that is, mutually unaligned) word vectors of several languages: English, German, Italian, and Croatian. 

Each monolingual embedding space is serialized in two files: a *pickled Python dictionary* (.vocab) that maps words to indices in the embedding matrix, and an embedding matrix (*a serialized 2D numpy array*, .vectors) that contains the actual embeddings. All embedding files should be in the **data** subdirectory, which should be in the same directory as this notebook. 

In [None]:
# we will load pre-trained serialized monolingual vectors
import pickle
import numpy

# Load (that is, "unpickle") the vocabularies for all languages 
vocab_en = #TODO: load the English vocabulary 
vectors_en = #TODO: load the English embedding vectors (2D numpy array) 

# let's see how many words we have in the vocabulary
print(len(vocab_en))

# let's see the dimensions of the embedding matrix
print(vectors_en.shape)

In [None]:
# let's see what the vector of some word looks like
word = "dog"
print("Index of " + word + ": " + str(vocab_en[word]))

vector = #TODO: print the embedding vector of the above specified word 
#(currently "dog", but feel free to change to any other word)

print("Vector of " + word + ": ")
print(vector)


In [None]:
# let us now load vectors and vocabularies of a few other languages

# German
vocab_de = #TODO: load the German vocabulary 
vectors_de = #TODO: load the German embeddings (2D numpy array) 

# Italian
vocab_it = #TODO: load the Italian vocabulary 
vectors_it = #TODO: load the Italian embeddings (2D numpy array) 

# Croatian
vocab_hr = #TODO: load the Croatian vocabulary  
vectors_hr = #TODO: load the Croatian embeddings (2D numpy array) 

# let's see how many entries we have in vocabularies of languages
print("DE", len(vocab_de), vectors_de.shape)
print("IT", len(vocab_it), vectors_it.shape)
print("HR", len(vocab_hr), vectors_hr.shape)

# TODO: What is the dimensionality of the embeddings (same for all languages :)?

Are the vectors from individual monolingual embedding spaces **comparable**? **No, they are not**. Let's verify that. Let's compare vector similarities within language and across languages

In [None]:
# cosine similarity is a common measure of similarity in vector space in NLP
# we just define a function that compute the cosine of the angle between the two vectors
# cosine similarity is a dot-product between the vectors normalized by the Euclidean (L2) norm of each vector
def cosine_sim(vec1, vec2):
    norm1 = numpy.linalg.norm(vec1)
    norm2 = numpy.linalg.norm(vec2)
    return numpy.dot(vec1, vec2) / (norm1 * norm2)

In [None]:
# let's see some monolingual similarities

language = "de" # play with different languages, "en", "it", "hr"

# just a shortcut, so we don't have to change the variables with vectors/vocabularies, we merely change the "language" variable
vectors = vectors_en if language == "en" else vectors_de if language == "de" else vectors_it if language == "it" else vectors_hr
vocab = vocab_en if language == "en" else vocab_de if language == "de" else vocab_it if language == "it" else vocab_hr

word1 = "hund" # play with different words
word2 = "katze" # play with different words

vector1 = #TODO: get the vector of the first word (word1) 
vector2 = #TODO: get the vector of the second word (word2)  

sim = #TODO: compute the cosine similarity between the embedding vectors of the two words 
print("Similarity between " + word1 + " and" + word2 + ": " + str(sim))


In [None]:
# let's put all embeddings and vocabularies into one dictionary
# just for easy access
emb_dict = {"en" : (vocab_en, vectors_en), 
            "de" : (vocab_de, vectors_de), 
            "it" : (vocab_it, vectors_it), 
            "hr" : (vocab_hr, vectors_hr)}

# lets create a more general function for comparing similarities between words from any two langs
def word_similarity(lang1, word1, lang2, word2):
    vocab_1 = #TODO: get the vocabulary of the lang1
    vectors_1 = #TODO: get the embeddings of the lang1 
    
    vocab_2 = #TODO: get the vocabulary of the lang2
    vectors_2 = #TODO: get the embeddings of the lang2
    
    vector_word_1 = vectors_1[vocab_1[word1]]
    vector_word_2 = vectors_2[vocab_2[word2]]
    
    return #TODO: return cosine similarity of vector_word_1 and vector_word_2
    


In [None]:
sim = word_similarity("de", "katze", "en", "cat") #TODO: compute the similarity between the German word "katze" and English word "cat"
print(sim)

So the independently built monolingual word embedding spaces of different languages are **not semantically aligned**. We need to **align them**. 

- We will do this by computing a **projection matrix** that rotates and translates one embedding space with respect to the other! 

- How do we know what we need to align? We provide some number of word translation pairs! 

In [None]:
import codecs

prefix_trans = "/work/gglavas/data/word_embs/yacle/translations/freq_split/pairwise/"

word_pairs_de_en =  [(l.strip().split("\t")[0], l.strip().split("\t")[1]) for l in codecs.open("data/translations.5k.de-en.tsv", "r", encoding = 'utf8', errors = 'replace').readlines()]
word_pairs_it_en =  [(l.strip().split("\t")[1], l.strip().split("\t")[0]) for l in codecs.open("data/translations.5k.en-it.tsv", "r", encoding = 'utf8', errors = 'replace').readlines()]
word_pairs_hr_en =  [(l.strip().split("\t")[1], l.strip().split("\t")[0]) for l in codecs.open("data/translations.5k.en-hr.tsv", "r", encoding = 'utf8', errors = 'replace').readlines()]


In [None]:
print(word_pairs_de_en)

In [None]:
# let's now create the matrices of aligned vectors of word translations, 
# given monolingual embeddings and word translation pairs

def align_word_vectors(src_vecs, src_vocab, trg_vecs, trg_vocab, trans_pairs):
    src_matrix =  []
    trg_matrix =  []
    
    # for each pair of words in our translation pairs
    for src_word, trg_word in trans_pairs:
        # add the vector of the source language word to the source matrix
        src_matrix.append(src_vecs[src_vocab[src_word.lower()]])
        # add the vector of the corresponding (translation) target language word to the target matrix
        trg_matrix.append(trg_vecs[trg_vocab[trg_word.lower()]])
        
    # return the row-aligned matrices (at the same index are vectors of mutual translations)
    # from these matrices, we will compute the projection/alignment matrix using the Procrustes method
    return numpy.array(src_matrix), numpy.array(trg_matrix)



In [None]:
src_de_en, trg_de_en = #TODO: call the align_word_vectors to align embeddings for given translation pairs 
                       # between German (source) and English (target)  

print(src_de_en.shape)
print(trg_de_en.shape)


In [None]:
# let's learn a projection matrix, given the aligned matrices of vectors of word translations
# we will use the so-called Procrustes solution of the alignment problem (i.e., finding the optimal projection matrix)

def get_projection_procrustes(src_mat, trg_mat):
    product = #TODO: complete this line so the result is a correct projection matrix from source to target embedding space
    U, S, V = numpy.linalg.svd(product)
    
    return numpy.matmul(U, V)

In [None]:
proj_mat_de_en = #TODO: obtain the projection matrix between German and English using the 
                 # translation-aligned embedding matrices 

#TODO: what is the shape of the projection matrix? Write the code that shows it


In [None]:
# let's now project the vectors of all the Italian words to the English vector space
proj_vectors_de = numpy.matmul(vectors_de, proj_mat_de_en)

# let's replace the original German vectors with the projected ones in the embeddings dictionary of all languages
emb_dict["de"] = (vocab_de, proj_vectors_de)

In [None]:
# let's see now what the German-English similarities look like after projection of German embeddings to the English emb. space
word_similarity("de", "katze", "en", "cat")

In [None]:
# let's perform the mapping to the English embedding space for the other two languages as well

## for Italian

# aligning vectors of word translations
src_it_en, trg_it_en = #TODO: align embeddings for translation pairs, Italian-English 
proj_mat_it_en = #TODO: Compute the projection matrix from Italian to English embedding space
proj_vectors_it = #TODO: project the vectors of all Italian words to the English space, using the obtained projection matrix
emb_dict["it"] = #TODO: replace the original (unaligned) Italian vectors with the projected (aligned) ones

In [None]:
word_similarity("it", "gatto", "en", "cat")

In [None]:
## for Croatian

src_hr_en, trg_hr_en = #TODO: align embeddings for translation pairs, Croatian-English  
proj_mat_hr_en = #TODO: Compute the projection matrix from Croatian to English embedding space
proj_vectors_hr = #TODO: project the vectors of all Croatian words to the English space, using the obtained projection matrix
emb_dict["hr"] = #TODO: replace the original (unaligned) Croatian vectors with the projected (aligned) ones

In [None]:
print(word_similarity("hr", "mačka", "en", "cat"))

# note that words of all four languages are now embedded in the same (originally English) embedding space
# so we can semantically compare words between any two of these languages (not just X-EN)
print(word_similarity("hr", "pas", "de", "hund"))
print(word_similarity("it", "gatto", "hr", "mačka"))
print(word_similarity("de", "flughafen", "it", "aeroporto"))

### Pretrained multilingual encoders!

Pretrained multilingual transformers enable the comparison of meaning of longer units of text in different languages. To this end, we will use pretrained transformers specialized for encoding sentence-level semantics: SentenceTransformers (package sentence-transformers). More information: 

https://www.sbert.net/

https://arxiv.org/pdf/1908.10084.pdf

https://arxiv.org/pdf/2004.09813.pdf


In [None]:
from sentence_transformers import SentenceTransformer

# loading the pretrained sentence encoder (concretely, the DistilmUSE model, distilled from multilingual USE)
sent_encoder = SentenceTransformer('distiluse-base-multilingual-cased-v2')

In [None]:
sent_en = "Hello World"
sent_de = "Hallo Welt"
sent_es = "Hola mundo"

sentence_embeddings = #TODO: encode (that is, obtain the embeddings for) the above three sentences using the loaded sent_encoder
print(sentence_embeddings.shape)

In [None]:
import numpy

sim_en_de = #TODO: Compute the cosine similarity between the embedding of the English sentence and the German sentence

sim_de_es = #TODO: Compute the cosine similarity between the embedding of the Spanish sentence and the German sentence

print(sim_en_de)
print(sim_de_es)

### Cross-lingual transfer for downstream NLP tasks

We will now see how to exploit a multilingual representation space (i.e., our multilingual sentence encoder) to train a model for a text classification task on annotated data in one language and then use that classification model to make predictions for texts from other languages.

**Task**: sentiment classification of Amazon reviews

**Annotated training data**: in English

In [None]:
# let's load the data using the *pandas* library
# we have two files: the training dataset and testing dataset in English

# importing the Python's Pandas library for data loading and manipulation
import pandas as pd

# Step #1: loading our annotated reviews
train_data = pd.read_csv("data/labeled_train.txt", delimiter = '\t') # in our file, the values are actually TAB-separated
eval_data = pd.read_csv("data/labeled_test.txt", delimiter = '\t')

# let's see what our data actually looks like
train_data




In [None]:
# task: predict the binary sentiment label (POS or NEG) from the encoding of the review text produced by the Sentence Transformer
# as the classifier, we will use simple logistic regression model

# filtering just the text from the pandas dataframe
train_texts = list(train_data["content"])

# embedding all texts with the sentence encoder
train_embeddings = #TODO: get the embeddings of the reviews (train_texts) with the sent_encoder

# converting labels from "POS" and "NEG" into numeric labels (0 and 1) for classification
train_labels = train_data["label"].tolist()
train_labels = [(1 if tl == "POS" else 0) for tl in train_labels]


In [None]:
# Now that we have input (embeddings of train texts) and labels, we can use them to train a classifier
# To this end, we will use a simple logistic regression classifier from scikit-learn

# We import the LogisticRegression class from scikit-learn
from sklearn.linear_model import LogisticRegression

# we now train ("fit") the logistic regression classifier by providing the training input (SBERT embeddings of our training set texts) and 
# corresponding sentiment labels (0 or 1)

classifier = LogisticRegression(C = 32, solver = 'lbfgs', max_iter = 1000)
classifier.fit(train_embeddings, train_labels)

# the result is a trained classifier, which we can examine more closely in the next steps and make predictions with
print(classifier)



In [None]:
# The classifier has now been trained. Let's predict the labels for the eval_set texts and see how accurate the classifier is
# We first convert the evaluation texts into embeddings with our sentence_transformer and labels to binary {0,1} labels

# filtering just the text from the pandas dataframe
eval_texts = list(eval_data["content"])
eval_embeddings = #TODO: get the embeddings of the evaluation reviews (eval_texts) with the sent_encoder 

# converting labels from "POS" and "NEG" into numeric labels (0 and 1) for classification
eval_labels = eval_data["label"].tolist()
eval_labels = [(1 if tl == "POS" else 0) for tl in eval_labels]

In [None]:
# now we call the classifier to predict and score (function "score") the predictions on the evaluation dataset
accuracy = classifier.score(eval_embeddings, eval_labels)
print("Classification accuracy: " + str(accuracy * 100) + "%")


In [None]:
# Let's see some predictions on individual texts

# we are generating predictions as probability distributions over the two classes, for each text
predict_probs = classifier.predict_proba(eval_embeddings)

# for a text at "index", we'll display the text itself and the class probability distribution produced by the LR classifier
# play by changing the "index"
index = 100
print(eval_texts[index])
print(predict_probs[index])



...and now, the **language transfer** (zero-shot)

**Zero-shot** means that we are able to make predictions for texts in the target language without seeing any training instances in that language (i.e., our classifier was trained on English reviews). However, because the input features for the classifier are generated by a multilingual text encoder, we can make predictions for any language that our Sentence Transformer (sent_encoder) can encode! (and that's around 100 languages).


In [None]:
# let's create a few reviews in other languages:

# "This is a bad speaker, you can barely hear anything"
rev_de = "Das ist ein ziemlich schlechter Lautsprecher, man kann kaum etwas hören."

# "The keyboard is light, easy to use and actually pretty beautiful. Totally worth the money!"
rev_it = "La tastiera è leggera, facile da usare e in realtà piuttosto bella. Vale assolutamente i soldi!" 

# This USB adapter is so average. Not bad, but I wouldn't give again that money for it. 
rev_hr = "Ovaj USB adapter je onako, prosječan. Nije loš, ali ne bih ponovno dao te novce za njega."

# This is by far the best USB charger I have ever had. My phone is charged in less than an hour, so amazing!
rev_zh = "这是迄今为止我所拥有的最好的USB充电器。我的手机在不到一小时内就充好电了，太神奇了!"

revs = [rev_de, rev_it, rev_hr, rev_zh]

In [None]:
# let's encode the reviews with the sentence_transformer 
rev_embeddings = #TODO: embed the reviews above with the sent_encoder

# now let's classify them and make predictions with our LR classifier
predict_probs = classifier.predict_proba(rev_embeddings)

# let's print each review and the predicted sentiment probabilities
for i in range(len(revs)):
    print(revs[i])
    print(predict_probs[i])
    pred_label = "NEG" if predict_probs[i][0] > predict_probs[i][1] else "POS"  
    print("Predicted label: " + pred_label)
    print()
    

In [None]:
# TODO: discuss the expected transfer performance for the above target languages. What could be the main factors
# that determine the success of transfer of a classifier based on the pretrained multilingual encoder from the source language
# (i.e., English) to a concrete target language?