# Multilingual Embeddings - 3. Transfer Learning

One of the most promising applications of multilingual word embeddings is cross-lingual transfer learning. Suppose you would like to train an NLP model in several languages, but you only have training data in one language. Collecting new training data for each of your target languages can be expensive, and translating every text you'd like to process is rather cumbersome. With cross-lingual word embeddings, however, you can transfer your model from one language to the other very efficiently. 

## Embeddings

Used two sets of embeddings from the [MUSE library](https://github.com/facebookresearch/MUSE) by Facebook Research. These are fastText Wikipedia supervised word embeddings for 30 languages, aligned in a single vector space. 

In [1]:
!wget https://s3.amazonaws.com/arrival/embeddings/wiki.multi.en.vec -O /tmp/wiki.multi.en.vec
!wget https://s3.amazonaws.com/arrival/embeddings/wiki.multi.fr.vec -O /tmp/wiki.multi.fr.vec

--2018-09-18 11:23:44--  https://s3.amazonaws.com/arrival/embeddings/wiki.multi.en.vec
Resolving s3.amazonaws.com... 52.216.22.53
Connecting to s3.amazonaws.com|52.216.22.53|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 628614720 (599M) [text/plain]
Saving to: '/tmp/wiki.multi.en.vec'


2018-09-18 11:24:57 (8.48 MB/s) - '/tmp/wiki.multi.en.vec' saved [628614720/628614720]

--2018-09-18 11:24:57--  https://s3.amazonaws.com/arrival/embeddings/wiki.multi.fr.vec
Resolving s3.amazonaws.com... 54.231.80.243
Connecting to s3.amazonaws.com|54.231.80.243|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 628737710 (600M) [text/plain]
Saving to: '/tmp/wiki.multi.fr.vec'


2018-09-18 11:26:16 (7.74 MB/s) - '/tmp/wiki.multi.fr.vec' saved [628737710/628737710]



In [2]:
import numpy as np

def load_vectors(embedding_file_path):
    print("Loading vectors from", embedding_file_path)
    embeddings = []
    word2id = {}
    with open(embedding_file_path, 'r', encoding='utf-8') as f:
        next(f)
        for i, line in enumerate(f):
            word, emb = line.rstrip().split(' ', 1)
            emb = np.fromstring(emb, sep=' ')
            assert word not in word2id, 'word found twice'
            embeddings.append(emb)
            word2id[word] = len(word2id)

    embeddings = np.vstack(embeddings)
    return embeddings, word2id

embeddings_en, embedding_word2id_en = load_vectors("/tmp/wiki.multi.en.vec")
embeddings_fr, embedding_word2id_fr = load_vectors("/tmp/wiki.multi.fr.vec")

Loading vectors from /tmp/wiki.multi.en.vec
Loading vectors from /tmp/wiki.multi.fr.vec


## Data Preprocessing 
Sentiment Analysis Using Keras

In [3]:
from keras.datasets import imdb
from keras.preprocessing import sequence

VOCABULARY_SIZE = 10000
INDEX_FROM = 3
START_INDEX = 1
OOV_INDEX = 2
EMBEDDING_DIM = 300
SEQ_LENGTH = 500

(X_train, y_train), (X_test, y_test) = imdb.load_data(path="imdb.npz",
                                                      num_words=VOCABULARY_SIZE,
                                                      skip_top=0,
                                                      maxlen=None,
                                                      seed=113,
                                                      start_char=START_INDEX,
                                                      oov_char=OOV_INDEX,
                                                      index_from=INDEX_FROM)

X_train = sequence.pad_sequences(X_train, maxlen=SEQ_LENGTH)
X_test = sequence.pad_sequences(X_test, maxlen=SEQ_LENGTH)

Using TensorFlow backend.


## Pre-trained word embeddings


In [4]:
def create_embedding_matrix(target_word2id, embedding_word2id, embeddings, num_rows, num_columns):
    embedding_matrix = np.zeros((num_rows, num_columns))
    for word, i in target_word2id.items():
        if i >= num_rows:
            continue
        if word in embedding_word2id: 
            embedding_matrix[i] = embeddings[embedding_word2id[word]]
    return embedding_matrix


In [5]:
word2id_en = imdb.get_word_index()
word2id_en = {k:(v+INDEX_FROM) for k,v in word2id_en.items()}
word2id_en["<PAD>"] = 0
word2id_en["<START>"] = START_INDEX
word2id_en["<UNK>"] = OOV_INDEX

embedding_matrix_en = create_embedding_matrix(word2id_en, embedding_word2id_en, 
                                              embeddings_en, VOCABULARY_SIZE+INDEX_FROM-1, EMBEDDING_DIM)

## Training our model


In [6]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.callbacks import EarlyStopping

model = Sequential()
model.add(Embedding(VOCABULARY_SIZE+INDEX_FROM-1, EMBEDDING_DIM, input_length=SEQ_LENGTH, weights=[embedding_matrix_en], trainable=False))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 300)          3000600   
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 500, 32)           28832     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 250, 32)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 8000)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 250)               2000250   
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 251       
Total params: 5,029,933
Trainable params: 2,029,333
Non-trainable params: 3,000,600
__________________________________________________________

In [7]:
NUM_EPOCHS=20
BATCH_SIZE=64

earlystop = EarlyStopping(monitor='val_loss', patience=2)

model.fit(X_train, y_train, validation_data=(X_test, y_test), 
          epochs=NUM_EPOCHS, batch_size=BATCH_SIZE, callbacks=[earlystop])
# Final evaluation of the model
scores = model.evaluate(X_test, y_test)
print("Accuracy: %.2f%%" % (scores[1]*100))

Train on 25000 samples, validate on 25000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Accuracy: 86.08%


In [8]:
test_texts = ["it was a great movie".split(), "it was a horrible movie".split()]
test_vectors = [[word2id_en.get(w, OOV_INDEX) for w in text] for text in test_texts]

padded_test_vectors = sequence.pad_sequences(test_vectors, maxlen=SEQ_LENGTH)

model.predict(padded_test_vectors)

array([[0.88016844],
       [0.15926579]], dtype=float32)

## Transferring our model

In [9]:
!pip3 install wordfreq



In [10]:
from wordfreq import top_n_list

print(top_n_list('en', 10))
print(top_n_list('fr', 10))

['the', 'of', 'to', 'and', 'a', 'in', 'i', 'is', 'that', 'for']
['de', 'la', 'le', 'et', 'l', 'à', 'les', 'est', 'des', 'en']


We'll construct our vocabulary from the most frequent 10,000 French words, and use it to create a French embedding matrix.

In [11]:
word2id_fr = {word: idx+INDEX_FROM for idx, word in enumerate(top_n_list('fr', VOCABULARY_SIZE))}
word2id_fr["<PAD>"] = 0
word2id_fr["<START>"] = START_INDEX
word2id_fr["<UNK>"] = OOV_INDEX

In [12]:
embedding_matrix_fr = create_embedding_matrix(word2id_fr, embedding_word2id_fr, 
                                              embeddings_fr, VOCABULARY_SIZE+INDEX_FROM-1, EMBEDDING_DIM)

In [13]:
model.layers[0].set_weights([embedding_matrix_fr])

In [14]:
test_texts_fr = ["c était un film merveilleux".split(), "c était un film horrible".split()]
test_vectors_fr = [[word2id_fr.get(w, OOV_INDEX) for w in text] for text in test_texts_fr]

padded_test_vectors_fr = sequence.pad_sequences(test_vectors_fr, maxlen=SEQ_LENGTH)

model.predict(padded_test_vectors_fr)

array([[0.724368  ],
       [0.28478226]], dtype=float32)

## Conclusion

Multilingual word embeddings offer us a simple and efficient way of transferring a model from one language to the other. This comes in handy when you need to apply your model to several languages, but you only have training data in one of them. 

Needless to say, transferring a model like this is not without its problems. After all, two different languages don't just have different vocabularies. They can also display variation in word order, etc. This can be problematic for models such as recurrent neural networks, but also for convolutional neural networks, which take the n-grams in a text as their input. 

When the two languages are reasonably close in linguistic terms, you should be able to get decent performance from transferred model. However, it's very likely a model that has been trained on data in the same language, will likely do better. 