# Translating French to English and back with Pytorch

In [1]:
%matplotlib inline
import re, pickle, collections, bcolz, numpy as np, keras, sklearn, math, operator

Using TensorFlow backend.


In [2]:
from gensim.models import word2vec
import gensim.models.keyedvectors
import torch, torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F



In [3]:
path='C:/Users/Gavrilov/My Projects/Translation_system_Neural_Net/fr-en-109-corpus/'
dpath = 'C:/Users/Gavrilov/My Projects/Translation_system_Neural_Net/translate/'

## Prepare corpus

The French-English parallel corpus can be downloaded from http://www.statmt.org/wmt10/training-giga-fren.tar. It was created by Chris Callison-Burch, who crawled millions of web pages and then used 'a set of simple heuristics to transform French URLs onto English URLs (i.e. replacing "fr" with "en" and about 40 other hand-written rules), and assume that these documents are translations of each other'.

In [4]:
fname=path+'giga-fren.release2.fixed'
en_fname = fname+'.en'
fr_fname = fname+'.fr'

To make this problem a little simpler so we can train our model more quickly, we'll just learn to translate questions that begin with 'Wh' (e.g. what, why, where which). Here are our regexps that filter the sentences we want.

In [5]:
re_eq = re.compile('^(Wh[^?.!]+\?)') #read chek: in english look things start with Wh and end with question mark
re_fq = re.compile('^([^?.!]+\?)') #read chek: in french it can be anything at all but it should end with question mark

In [6]:
lines = ((re_eq.search(eq), re_fq.search(fq)) 
         for eq, fq in zip(open(en_fname, encoding='utf8'), open(fr_fname, encoding='utf8'))) 
#once you go open in Python, that return iterator that we can loop through
#when we zip two together we got english questions and french questions

In [7]:
qs = [(e.group(), f.group()) for e,f in lines if e and f]; len(qs) 
#then we just go through and run the regex, and then return the ones that both of the regex's match

52331

In [8]:
qs[:6]

[('What is light ?', 'Qu’est-ce que la lumière?'),
 ('Who are we?', 'Où sommes-nous?'),
 ('Where did we come from?', "D'où venons-nous?"),
 ('What would we do without it?', 'Que ferions-nous sans elle ?'),
 ('What is the absolute location (latitude and longitude) of Badger, Newfoundland and Labrador?',
  'Quelle sont les coordonnées (latitude et longitude) de Badger, à Terre-Neuve-etLabrador?'),
 ('What is the major aboriginal group on Vancouver Island?',
  'Quel est le groupe autochtone principal sur l’île de Vancouver?')]

Because it takes a while to load the data, we save the results to make it easier to load in later.

In [9]:
pickle.dump(qs, open(dpath+'fr-en-qs.pkl', 'wb'))

In [10]:
qs = pickle.load(open(dpath+'fr-en-qs.pkl', 'rb'))

In [11]:
en_qs, fr_qs = zip(*qs)

Because we are translating at word level, we need to tokenize the text first. (Note that it is also possible to translate at character level, which doesn't require tokenizing.) There are many tokenizers available, but we found we got best results using these simple heuristics.
P.S. We also could use NLTK that has a lot of tokenizers in it.

In [12]:
re_apos = re.compile(r"(\w)'s\b")         # make 's a separate word
re_mw_punc = re.compile(r"(\w[’'])(\w)")  # other ' in a word creates 2 words
re_punc = re.compile("([\"().,;:/_?!—])") # add spaces around punctuation
re_mult_space = re.compile(r"  *")        # replace multiple spaces with just one
def simple_toks(sent):
    sent = re_apos.sub(r"\1 's", sent)
    sent = re_mw_punc.sub(r"\1 \2", sent)
    sent = re_punc.sub(r" \1 ", sent).replace('-', ' ')
    sent = re_mult_space.sub(' ', sent)
    return sent.lower().split()

In [13]:
fr_qtoks = list(map(simple_toks, fr_qs)); fr_qtoks[:4]

[['qu’', 'est', 'ce', 'que', 'la', 'lumière', '?'],
 ['où', 'sommes', 'nous', '?'],
 ["d'", 'où', 'venons', 'nous', '?'],
 ['que', 'ferions', 'nous', 'sans', 'elle', '?']]

In [14]:
en_qtoks = list(map(simple_toks, en_qs)); en_qtoks[:4]

[['what', 'is', 'light', '?'],
 ['who', 'are', 'we', '?'],
 ['where', 'did', 'we', 'come', 'from', '?'],
 ['what', 'would', 'we', 'do', 'without', 'it', '?']]

In [15]:
simple_toks("Rachel's baby is cuter than other's.")

['rachel', "'s", 'baby', 'is', 'cuter', 'than', 'other', "'s", '.']

Special tokens used to pad the end of sentences, and to mark the start of a sentence.

In [16]:
PAD = 0; SOS = 1

Enumerate the unique words (*vocab*) in the corpus, and also create the reverse map (word->index). Then use this mapping to encode every sentence as a list of int indices.

In [17]:
def toks2ids(sents):
    voc_cnt = collections.Counter(t for sent in sents for t in sent) #create our vocabulary
    vocab = sorted(voc_cnt, key=voc_cnt.get, reverse=True)
    vocab.insert(PAD, "<PAD>") #insert padding character
    vocab.insert(SOS, "<SOS>") #insert start of stream character
    w2id = {w:i for i,w in enumerate(vocab)} #create reverse mappig from words to id
    ids = [[w2id[t] for t in sent] for sent in sents] #return list of tokens
    return ids, vocab, w2id, voc_cnt

In [18]:
fr_ids, fr_vocab, fr_w2id, fr_counts = toks2ids(fr_qtoks) 
#returning list of ID for each sentence, vocabulary, reverse vocabulary, and frequency count for vocabulary
en_ids, en_vocab, en_w2id, en_counts = toks2ids(en_qtoks)

## Word vectors

Stanford's GloVe word vectors can be downloaded from https://nlp.stanford.edu/projects/glove/ (in the code below we have preprocessed them into a bcolz array). We use these because each individual word has a single word vector, which is what we need for translation. Word2vec, on the other hand, often uses multi-word phrases.

In [19]:
with open('C:/Users/Gavrilov/.keras/datasets/glove_6B/glove.6B.100d.txt', 'r', encoding="utf8") as f: 
    lines = [line.split() for line in f] #lines.split() -without value '\n' because each iteration gives line and that is fine

In [22]:
glove_words = [elem[0] for elem in lines]
glove_words_idx = {elem:idx for idx,elem in enumerate(glove_words)} #is elem:idx equal to glove_words_idx[elem]=idx?
glove_vecs = np.stack(np.array(elem[1:], dtype=np.float32) for elem in lines) #np.float32 -standard double-precision floating point

In [24]:
pickle.dump(glove_words, open('C:/Users/Gavrilov/.keras/datasets/glove_6B/glove.6B.100d.txt'+'_glove_words.pkl', 'wb'))
pickle.dump(glove_words_idx, open('C:/Users/Gavrilov/.keras/datasets/glove_6B/glove.6B.100d.txt'+'_glove_words_idx.pkl', 'wb'))

In [30]:
#saving array using specific function
def save_array(fname, arr):
    c=bcolz.carray(arr, rootdir=fname, mode='w')
    c.flush()
save_array('C:/Users/Gavrilov/.keras/datasets/glove_6B/glove.6B.100d.txt'+'_glove_vecs'+'.dat', glove_vecs)

In [34]:
def load_glove(loc):
    return (bcolz.open(loc+'_glove_vecs.dat')[:],
        pickle.load(open(loc+'_glove_words.pkl','rb'), encoding='utf8'),
        pickle.load(open(loc+'_glove_words_idx.pkl','rb'), encoding='utf8'))

In [35]:
en_vecs, en_wv_word, en_wv_idx = load_glove('C:/Users/Gavrilov/.keras/datasets/glove_6B/glove.6B.100d.txt')
en_w2v = {w: en_vecs[en_wv_idx[w]] for w in en_wv_word}
n_en_vec, dim_en_vec = en_vecs.shape

In [36]:
en_w2v['king']

array([-0.32306999, -0.87616003,  0.21977   ,  0.25268   ,  0.22976001,
        0.73879999, -0.37954   , -0.35306999, -0.84368998, -1.11129999,
       -0.30265999,  0.33177999, -0.25113001,  0.30447999, -0.077491  ,
       -0.89815003,  0.092496  , -1.14069998, -0.58323997,  0.66869003,
       -0.23122001, -0.95854998,  0.28262001, -0.078848  ,  0.75314999,
        0.26583999,  0.34220001, -0.33949   ,  0.95608002,  0.065641  ,
        0.45747   ,  0.39835   ,  0.57964998,  0.39267001, -0.21851   ,
        0.58794999, -0.55998999,  0.63367999, -0.043983  , -0.68730998,
       -0.37841001,  0.38025999,  0.61641002, -0.88269001, -0.12346   ,
       -0.37928   , -0.38317999,  0.23868001,  0.66850001, -0.43320999,
       -0.11065   ,  0.081723  ,  1.15690005,  0.78957999, -0.21223   ,
       -2.3211    , -0.67806   ,  0.44560999,  0.65706998,  0.1045    ,
        0.46217   ,  0.19912   ,  0.25802001,  0.057194  ,  0.53443003,
       -0.43133   , -0.34311   ,  0.59789002, -0.58416998,  0.06

Also we need to grab French word vectors which we using from http://fauconnier.github.io/index.html 
direct link http://embeddings.org/frWac_non_lem_no_postag_no_phrase_200_skip_cut100.bin

In [43]:
w2v_path='C:/Users/Gavrilov/.keras/datasets/frWac_non_lem_no_postag_no_phrase_200_skip_cut100.bin'
#fr_model = word2vec.Word2Vec.load_word2vec_format(w2v_path, binary=True)
fr_model = gensim.models.KeyedVectors.load_word2vec_format(w2v_path, binary=True)
fr_voc = fr_model.vocab
dim_fr_vec = 200

We need to map each word index in our vocabs to their word vector. Not every word in our vocabs will be in our word vectors, since our tokenization approach won't be identical to the word vector creators - in these cases we simply create a random vector.

In [44]:
def create_emb(w2v, targ_vocab, dim_vec):
    vocab_size = len(targ_vocab)
    emb = np.zeros((vocab_size, dim_vec)) #creating big array of zeros
    found=0 #keeping track of founfed words in the loop

    for i, word in enumerate(targ_vocab):
        try: emb[i] = w2v[word]; found+=1 #filling array of zeros with word vectors if we can
        except KeyError: emb[i] = np.random.normal(scale=0.6, size=(dim_vec,)) #if we can't we just stick a random vector there instead

    return emb, found

In [45]:
en_embs, found = create_emb(en_w2v, en_vocab, dim_en_vec); en_embs.shape, found

((19549, 100), 17251)

In [46]:
fr_embs, found = create_emb(fr_model, fr_vocab, dim_fr_vec); fr_embs.shape, found

((26709, 200), 21878)

## Prep data

Each sentence has to be of equal length. Keras has a convenient function `pad_sequences` to truncate and/or pad each sentence as required - even although we're not using keras for the neural net, we can still use any functions from it we need!

In [47]:
from keras.preprocessing.sequence import pad_sequences

In [48]:
maxlen = 30
en_padded = pad_sequences(en_ids, maxlen, 'int64', "post", "post")
fr_padded = pad_sequences(fr_ids, maxlen, 'int64', "post", "post")
en_padded.shape, fr_padded.shape, en_embs.shape

((52331, 30), (52331, 30), (19549, 100))

In [156]:
n = int(len(en_ids)*0.9)
idxs = np.random.permutation(len(en_ids))
fr_train, fr_test = fr_padded[idxs][:n], fr_padded[idxs][n:]
en_train, en_test = en_padded[idxs][:n], en_padded[idxs][n:]

## Model

In [157]:
en_train.shape

(47097, 30)

In [159]:
from keras_tqdm import TQDMNotebookCallback

In [160]:
parms = {'verbose': 0, 'callbacks': [TQDMNotebookCallback()]}

In [161]:
fr_wgts = [fr_embs.T, np.zeros((len(fr_vocab,)))]

In [165]:
from keras.layers import Embedding, Bidirectional, LSTM, TimeDistributed, Dense, Activation, Input

In [166]:
inp = Input((maxlen,))
x = Embedding(len(en_vocab), dim_en_vec, input_length=maxlen,
              weights=[en_embs], trainable=False)(inp)
x = Bidirectional(LSTM(128, return_sequences=True))(x)
x = Bidirectional(LSTM(128, return_sequences=True))(x)
x = LSTM(128, return_sequences=True)(x)
x = TimeDistributed(Dense(dim_fr_vec))(x)
x = TimeDistributed(Dense(len(fr_vocab), weights=fr_wgts))(x)
x = Activation('softmax')(x)

In [169]:
from keras.models import Model

In [176]:
model = Model(inp, x)
model.compile('adam', 'sparse_categorical_crossentropy', metrics=['accuracy'])

In [173]:
import keras.backend as K

In [177]:
K.set_value(model.optimizer.lr, 1e-3)

In [178]:
hist=model.fit(en_train, np.expand_dims(fr_train,-1), batch_size=64, epochs=20, **parms, 
               validation_data=[en_test, np.expand_dims(fr_test,-1)])

A Jupyter Widget

A Jupyter Widget

KeyboardInterrupt: 

In [179]:
model.save_weights(dpath+'trans.h5')

## Testing

In [180]:
def sent2ids(sent):
    sent = simple_toks(sent)
    ids = [en_w2id[t] for t in sent]
    return pad_sequences([ids], maxlen, padding="post", truncating="post")

In [181]:
def en2fr(sent): 
    ids = sent2ids(sent)
    tr_ids = np.argmax(model.predict(ids), axis=-1)
    return ' '.join(fr_vocab[i] for i in tr_ids[0] if i>0)

In [182]:
en2fr("what is the size of canada?")

'quelles est les les de ? ?'

## Fit the model (just one epoch this time)

In [186]:
hist=model.fit(en_train, np.expand_dims(fr_train,-1), batch_size=64, epochs=1, **parms, 
               validation_data=[en_test, np.expand_dims(fr_test,-1)])

A Jupyter Widget




A Jupyter Widget



Exception in thread Thread-6:
Traceback (most recent call last):
  File "C:\Users\Gavrilov\Anaconda3\lib\threading.py", line 916, in _bootstrap_inner
    self.run()
  File "C:\Users\Gavrilov\Anaconda3\lib\site-packages\tqdm\_tqdm.py", line 144, in run
    for instance in self.tqdm_cls._instances:
  File "C:\Users\Gavrilov\Anaconda3\lib\_weakrefset.py", line 60, in __iter__
    for itemref in self.data:
RuntimeError: Set changed size during iteration






In [187]:
model.save_weights(dpath+'trans.h')

In [189]:
en2fr('what is the size of canada?')

'quel est ce que ? ? ?'