# Artificial LanguageTranslation

Find all Links and extensive Description here:

https://docs.google.com/document/d/1EUcLajYHXuMnziPqYMW88uAXuXOpAE9o88tgRgjWjbM/edit?ts=5bb1374e#

This notebook follows: 
https://github.com/fastai/fastai/blob/master/courses/dl2/translate.ipynb
http://course.fast.ai/lessons/lesson11.html

In [7]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

## Goal

This notebook uses common seg2seg translations build to translate natural languages into each other to translate different computer languages with the example of SQL into each other. 

Find a more extensive discussion of motivation & usecases here: 
https://docs.google.com/document/d/1EUcLajYHXuMnziPqYMW88uAXuXOpAE9o88tgRgjWjbM/edit?ts=5bb1374e#

## Overview

In [None]:
We need 3 things:
    - Data: pairs of 2 SQL statements SQL_MYSQL SQL_MONGODB
    - Architecture: 
        - RNN with ?attention?
        - For Neural Translation the Architecture matters
    - Loss Function: 
        - says that this SQL_MYSQL should have generated SQL_MONGODB, 
        - we have predictions and then see how good the function was
    

    
What is a RNN?
    - a fully connected network
    - previous layers can pass in values to subsequent values
    - RNN outputs can be passed on from one Network to another (whats the point ?)

Architecture:
    - same thing like in a language model, 
    - we pass the SQL_MYSQL through a RNN
    - result is one hidden state, 'backbone' / 'encoder'. This is just a vector (matrix with mini batch)
    - with the representation one could e.g. put a liner layer to get a sentiment
    - here however goal is to get a sequence of tokens, where these tokens represent to SQL_MONGODB
    - we have a similar goal in a language model, where we predict the next word
        - number of tokens in- to output in language model was same length
        - number of tokens in- to output in language model are not guaranteed to have words in same position
    - GOAL: 
        - Similar to natural language translation: arbirary length output where the tokens in the inut do not necessarily correspond to the tokens in the output
        - additionally for artificial languages: 
            - the output must be syntactically correct formed --> just pass through a compiler
            - variable values should be immutable! 1 is 1 'hello' is 'hello'
                --> anything which is not a keyword is a variable. how do we teach the Network to translate them as immutable?
    -
        


High level we execute the following steps:


This is just the same like e.g. Google Translate does. The difference is that we build up Word Vectors / Language Model.

## 1. DATA

### Generating the Dataset

we use two standard DBs and their SQL dialatecs with a slightly different data model to build pairs of statements

In [8]:
## https://www.w3schools.com/python/python_mysql_create_db.asp

## https://www.w3schools.com/python/python_mongodb_getstarted.asp

### Translation Files

In [None]:
Steps:
-    Tokenize: this step for NLP is done with tokenizers like spacy. 
        We use a list of keywords and parse the string to identify then
        Then any string in the statement that is not a keyword is a variable and needs to be treated as immutable
- Assign numbers to to the keywords

In [9]:
## check this, what does it and useful for our kinds of text ?
## spacy? no, do we need this here?
from fastai.text import *

ModuleNotFoundError: No module named 'spacy'

In [None]:
## adjust this to 2 sql languages
PATH = Path('data/translate')
TMP_PATH = PATH/'tmp'
TMP_PATH.mkdir(exist_ok=True)
fname='giga-fren.release2.fixed'
en_fname = PATH/f'{fname}.en'
fr_fname = PATH/f'{fname}.fr'

In [None]:
## we will not need this, this is just about taking questions only to reduce the dataset
re_eq = re.compile('^(Wh[^?.!]+\?)')
re_fq = re.compile('^([^?.!]+\?)')

lines = ((re_eq.search(eq), re_fq.search(fq)) 
         for eq, fq in zip(open(en_fname, encoding='utf-8'), open(fr_fname, encoding='utf-8')))

qs = [(e.group(), f.group()) for e,f in lines if e and f]

In [None]:
## just save this, so we do not need to repeat this above cut down exercise
pickle.dump(qs, (PATH/'fr-en-qs.pkl').open('wb'))

In [None]:
qs = pickle.load((PATH/'fr-en-qs.pkl').open('rb'))

In [None]:
## just some examples
qs[:5], len(qs)

In [None]:
en_qs,fr_qs = zip(*qs)

Tokenizing Artificial Languages 
- could be as imple as having a list of keywords.
- any string in the text that is not a keyword, is a variable and needs to be treated as immutable (how)

In [None]:
## change this, no spacy wrapper needed
en_tok = Tokenizer.proc_all_mp(partition_by_cores(en_qs))

In [None]:
## change this, no spacy wrapper needed
fr_tok = Tokenizer.proc_all_mp(partition_by_cores(fr_qs), 'fr')

In [None]:
en_tok[0], fr_tok[0]

In [None]:
np.percentile([len(o) for o in en_tok], 90), np.percentile([len(o) for o in fr_tok], 90)

In [None]:
keep = np.array([len(o)<30 for o in en_tok])

In [None]:
en_tok = np.array(en_tok)[keep]
fr_tok = np.array(fr_tok)[keep]

In [None]:
pickle.dump(en_tok, (PATH/'en_tok.pkl').open('wb'))
pickle.dump(fr_tok, (PATH/'fr_tok.pkl').open('wb'))

In [None]:
en_tok = pickle.load((PATH/'en_tok.pkl').open('rb'))
fr_tok = pickle.load((PATH/'fr_tok.pkl').open('rb'))

In [None]:
## this needs to change, we might need to dictionaries to identify variables that are immutable
## Also, we always have the same list of keywords, therefore we might do all this differnt (how)
def toks2ids(tok,pre):
    freq = Counter(p for o in tok for p in o)
    itos = [o for o,c in freq.most_common(40000)]
    itos.insert(0, '_bos_')
    itos.insert(1, '_pad_')
    itos.insert(2, '_eos_')
    itos.insert(3, '_unk')
    stoi = collections.defaultdict(lambda: 3, {v:k for k,v in enumerate(itos)})
    ids = np.array([([stoi[o] for o in p] + [2]) for p in tok])
    np.save(TMP_PATH/f'{pre}_ids.npy', ids)
    pickle.dump(itos, open(TMP_PATH/f'{pre}_itos.pkl', 'wb'))
    return ids,itos,stoi

In [None]:
## mapping from tokens to IDs
en_ids,en_itos,en_stoi = toks2ids(en_tok,'en')
fr_ids,fr_itos,fr_stoi = toks2ids(fr_tok,'fr')

In [None]:
def load_ids(pre):
    ids = np.load(TMP_PATH/f'{pre}_ids.npy')
    itos = pickle.load(open(TMP_PATH/f'{pre}_itos.pkl', 'rb'))
    stoi = collections.defaultdict(lambda: 3, {v:k for k,v in enumerate(itos)})
    return ids,itos,stoi

In [None]:
en_ids,en_itos,en_stoi = load_ids('en')
fr_ids,fr_itos,fr_stoi = load_ids('fr')

In [None]:
## just a test here
[fr_itos[o] for o in fr_ids[0]], len(en_itos), len(fr_itos)

Here we generate an artificial pair of language with two SQL dialects: https://www.w3schools.com/python/python_mysql_create_db.asp


### Meaning through Context

#### Word Vectors

We build one up artificially, this could be from the Documentation of the language.

Note, the cool way to do this would be build a language model for mysql - see below.

In [None]:
## build word vectors for articial langauges like this --> can we autorpduce them?
fasttext word vectors available from https://fasttext.cc/docs/en/english-vectors.html

In [None]:
# ! pip install git+https://github.com/facebookresearch/fastText.git

In [None]:
import fastText as ft

In [None]:
## To use the fastText library, you'll need to download fasttext word vectors for your language (download the 'bin plus text' ones).

In [None]:
en_vecs = ft.load_model(str((PATH/'wiki.en.bin')))

In [None]:
fr_vecs = ft.load_model(str((PATH/'wiki.fr.bin')))

In [None]:
def get_vecs(lang, ft_vecs):
    vecd = {w:ft_vecs.get_word_vector(w) for w in ft_vecs.get_words()}
    pickle.dump(vecd, open(PATH/f'wiki.{lang}.pkl','wb'))
    return vecd

In [None]:
en_vecd = get_vecs('en', en_vecs)
fr_vecd = get_vecs('fr', fr_vecs)

In [None]:
en_vecd = pickle.load(open(PATH/'wiki.en.pkl','rb'))
fr_vecd = pickle.load(open(PATH/'wiki.fr.pkl','rb'))

In [None]:
ft_words = en_vecs.get_words(include_freq=True)
ft_word_dict = {k:v for k,v in zip(*ft_words)}
ft_words = sorted(ft_word_dict.keys(), key=lambda x: ft_word_dict[x])

len(ft_words)

In [None]:
dim_en_vec = len(en_vecd[','])
dim_fr_vec = len(fr_vecd[','])
dim_en_vec,dim_fr_vec

In [None]:
en_vecs = np.stack(list(en_vecd.values()))
## mean and standard deviation
en_vecs.mean(),en_vecs.std()

[TO DO]
Language Model

We build one up, this could be log files or Documentation of the Language. 
https://github.com/fastai/fastai/blob/master/courses/dl1/lang_model-arxiv.ipynb

Could we build language models for artificial languages, which are pre-trained? E.g. for MYSQL?
But is it useful? for articial langauges this is all much simpler (syntactical rules) you could even give the syntactial rules here what a compiler does


### Model Data



In [None]:
## purely for processing speed up, we make sure the length of the longest sequence gets truncated
enlen_90 = int(np.percentile([len(o) for o in en_ids], 99))
frlen_90 = int(np.percentile([len(o) for o in fr_ids], 97))
enlen_90,frlen_90

In [None]:
en_ids_tr = np.array([o[:enlen_90] for o in en_ids])
fr_ids_tr = np.array([o[:frlen_90] for o in fr_ids])

In [None]:
## training set
## length & index for pytorch
## convention: v variables, t tensors, a arrays
class Seq2SeqDataset(Dataset):
    def __init__(self, x, y): self.x,self.y = x,y
    ## anything that is not yet a numpy array gets turned into it    
    def __getitem__(self, idx): return A(self.x[idx], self.y[idx]) 
    def __len__(self): return len(self.x)

In [None]:
## easy way to get training & validation set
np.random.seed(42)
trn_keep = np.random.rand(len(en_ids_tr))>0.1 ## randomlist of bools to index into the set
en_trn,fr_trn = en_ids_tr[trn_keep],fr_ids_tr[trn_keep]
en_val,fr_val = en_ids_tr[~trn_keep],fr_ids_tr[~trn_keep]
len(en_trn),len(en_val)

In [None]:
## for english to french, just switch around. This is the training & validation set here
trn_ds = Seq2SeqDataset(fr_trn,en_trn)
val_ds = Seq2SeqDataset(fr_val,en_val)

In [None]:
bs=125

In [None]:
##  validation set: sort by length, training set: randomize the order of things, so similar things about similar spot
trn_samp = SortishSampler(en_trn, key=lambda x: len(en_trn[x]), bs=bs)
val_samp = SortSampler(en_val, key=lambda x: len(en_val[x]))

In [None]:
## minute 45
## why do we need to transpose the oriantation?
## we did pre-work, no augmentation (?)
## padding index
## for classifier padding at the start, here pre-padding = false for encider
trn_dl = DataLoader(trn_ds, bs, transpose=True, transpose_y=True, num_workers=1, 
                    pad_idx=1, pre_pad=False, sampler=trn_samp) ## uses fast.ai behind the scenes
val_dl = DataLoader(val_ds, int(bs*1.6), transpose=True, transpose_y=True, num_workers=1, 
                    pad_idx=1, pre_pad=False, sampler=val_samp)## uses fast.ai behind the scenes

## says: I ahve a training & validation set (optional test set): into one object with a path to store temporary stuff
md = ModelData(PATH, trn_dl, val_dl)
## after that you can create a learner and call fit

In [None]:
it = iter(trn_dl)
its = [next(it) for i in range(5)]
[(len(x),len(y)) for x,y in its]

## 2. Architecture

### Initial Model

- Takes sequence of Tokens
- RNN: inject into an encoder (backbone) to turn this into a representation
- This RNN outputs the final hidden state, a vector per sentence
- place this output into a Decoder RNN: this decoder can go through one word by word
- Continue until 'it thinks' sentence is finished and give that back

In [None]:
## number of rows = vocabulary size, each word has a vector
## how big? fast text says size 300 !
def create_emb(vecs, itos, em_sz):
    ## random embeddings: if we find it in fast test we replace with that finding
    emb = nn.Embedding(len(itos), em_sz, padding_idx=1)
    wgts = emb.weight.data ## pytorch weight attribute is a variable. Vars have data attributes, that is a tensor
    miss = []
    ## with weight tensor we can now go through our vocabulary
    for i,w in enumerate(itos):
        ## Hacky: the 3 is about aligning standard deviations
        try: wgts[i] = torch.from_numpy(vecs[w]*3) ## replace random weights with pre-trained vectors. 
        except: miss.append(w) ## what is not isn fast text we keep track of it
    print(len(miss),miss[5:10])
    return emb

In [None]:
## check enc vs dec here for encoder and decoder. emb embedding
## out is output
## GRU is similar to LSTM

# Remember, we pass in an index. 
class Seq2SeqRNN(nn.Module):
    def __init__(self, vecs_enc, itos_enc, em_sz_enc, vecs_dec, itos_dec, em_sz_dec, nh, out_sl, nl=2):
        super().__init__()
        self.nl,self.nh,self.out_sl = nl,nh,out_sl
        ## create encode embedding
        self.emb_enc = create_emb(vecs_enc, itos_enc, em_sz_enc)
        ## add dropout
        self.emb_enc_drop = nn.Dropout(0.15)
        ## create the RNN: em_sz_enc = size of embedding, nh = our choice (56 for now), 
        ## num_layers: how many layers do we want, some dropout inside the RNN
        self.gru_enc = nn.GRU(em_sz_enc, nh, num_layers=nl, dropout=0.25) ## standard pytorch, you could use LSTM too
        ## some output to fit the decoder, so lets use a linear layer
        self.out_enc = nn.Linear(nh, em_sz_dec, bias=False) ## nh: number of hidden into the decoder embedding size
        
        self.emb_dec = create_emb(vecs_dec, itos_dec, em_sz_dec)
        self.gru_dec = nn.GRU(em_sz_dec, em_sz_dec, num_layers=nl, dropout=0.1) ## or take LSTM
        self.out_drop = nn.Dropout(0.35)
        self.out = nn.Linear(em_sz_dec, len(itos_dec))
        self.out.weight.data = self.emb_dec.weight.data
    ## forward pass
    
    def forward(self, inp):
        #####################
        ## the most simple RNN: takes our inout and spits out a hidden vector that hopefull 
        ## will learn to contain all of the
        ## about what the sentence says and how it says it
        ## Only then we can hope it gives us a translation
        sl,bs = inp.size()
        ## initialise our hidden state to some zeros, vector of 0
        h = self.initHidden(bs)
        ## inout through embedding, dropout
        emb = self.emb_enc_drop(self.emb_enc(inp))
        ## pass 0 hidden state into our RMM
        ## gives back final hidden state
        enc_out, h = self.gru_enc(emb, h)
        ## pass through linear layer
        h = self.out_enc(h)
        ####################
        
        ##### ADDITIONAL to simple RNN
        
        ## dec_inp: represents the previous word that we translated
        ## so tell me word 4, then you need the word 4 in a sentence. 
        ## therefore we feed that in into each time step
        ## see function toks2ids, above
            ## itos.insert(0, '_bos_') : this is the 1st token, beginning of stream
            ## itos.insert(1, '_pad_')
            ## itos.insert(2, '_eos_')
            ## itos.insert(3, '_unk')
        
        dec_inp = V(torch.zeros(bs).long()) ## _bos_ for 1st run
        res = []
        ## for loop does the same as 4 steps inside pytorch, see above
        for i in range(self.out_sl): ## output sequence length (see constructor) 
            ##= length of largest english sentence, because we translate into english (for this corpus)
            
            ## Normally: RNN gru_dec works in a whole sequence at a time, but we have a for loop
            ## add leading unit access to the start unsqueeze. 
            ##So, we do not use the RNN really and could rewrite the thing with a linear layer
            ## take input dec_inp and fit in the embedding emb_dec unsqueeze says: 
            ##treat this as a eequence of length 1
            
            ## 1. put through embedding
            ## 1st run: what is the vector for beginning of stream token is
            emb = self.emb_dec(dec_inp).unsqueeze(0) 
            
            ## 2. put through RNN
            ## 1st run: h is whatever came out of encoder. This figures out what the 1st word ir
            outp, h = self.gru_dec(emb, h) 
            
            ## 3. put through dropout 4. put through linear layer
            ## 1st run 3: dropout
            ## 1st run 4: linear layer in order to convert that into correct size for our decoder embedding matrix
            outp = self.out(self.out_drop(outp[0]))
            ## append that output to a list
            
            ## 5. Addpend to list of translated words
            ## outp: is a tensor whose length is equual to the no. words in english vocabulary an\
            ## contains the probability that that word is the word
            ## pdb.settrace
            res.append(outp) 
            ## stack up list to a tensor and return it
            
            ## 6. takes the highes probability: check tensor for highes probability and give that index
            
            dec_inp = V(outp.data.max(1)[1]) ##1 is word index of largest things
            ## dec_inp 1  is padding, so we are finished. Or largest sentence length
            if (dec_inp==1).all(): break
        ## we stack up these vector probabilities into a tensor, so we can feed this to a loss function
        return torch.stack(res)
    
    def initHidden(self, bs): return V(torch.zeros(self.nl, bs, self.nh))

## 3. Loss Function

Loss Function is categorical cross entropy loss:
- list of probabilities for each of our classes (class all words in our english vocab)
- target: correct class, correct word at this location

In [None]:

## tweak no.2
def seq2seq_loss(input, target):
    sl,bs = target.size()
    sl_in,bs_in,nc = input.size()
    ## tweak no.1 : we might have stopped early , so sequence length could be smaller than target.
    ##    so we add padding
    ##  pytroch padding: 
    ##      rank 3 tensor (sequence length x batch size x no. words of vocab)
    ##      6 tuple required: each pair padding before padding after 1:08:30
    ## 1st dim. & 2nd dim no padding 3. dim no padding left as much as required on the right
    if sl>sl_in: input = F.pad(input, (0,0,0,0,0,sl-sl_in))
    input = input[:sl]
    ## tweak 2cross entropy loss expects rank 2 tensor, we have 3 , so flatten out: -1 in view
    return F.cross_entropy(input.view(-1,nc), target.view(-1))#, ignore_index=1)

In [None]:
opt_fn = partial(optim.Adam, betas=(0.8, 0.99))

In [None]:
## this gives some stuff that misses in that word vector (not relevant for us, just: these would be variables)
## standard pytorch
rnn = Seq2SeqRNN(fr_vecd, fr_itos, dim_fr_vec, en_vecd, en_itos, dim_en_vec, nh, enlen_90)
## put to the GPU
## SingleModel turns pytorch model into fast.ai Model min 1:09:40:
## how to handle learnng rate groups (fast.ai concept)
## we call RNN_Learner (not just Learner): RNN_Learner has cross entropy as default criteria
## check save & load_encoder --> not really required
learn = RNN_Learner(md, SingleModel(to_gpu(rnn)), opt_fn=opt_fn)
## now we give our learner that loss function
learn.crit = seq2seq_loss

Fit the Model

In [None]:
learn.lr_find()
learn.sched.plot()

In [None]:
lr=3e-3

In [None]:
learn.fit(lr, 1, cycle_len=12, use_clr=(20,10))

In [None]:
learn.save('initial')

In [None]:
learn.load('initial')

### Test Initial Model

In [None]:
## CONTINUE 1:11:10

In [None]:
x,y = next(iter(val_dl))
probs = learn.model(V(x))
preds = to_np(probs.max(2)[1])

for i in range(180,190):
    print(' '.join([fr_itos[o] for o in x[:,i] if o != 1]))
    print(' '.join([en_itos[o] for o in y[:,i] if o != 1]))
    print(' '.join([en_itos[o] for o in preds[:,i] if o!=1]))
    print()

### Bidirectional

In [None]:
class Seq2SeqRNN_Bidir(nn.Module):
    def __init__(self, vecs_enc, itos_enc, em_sz_enc, vecs_dec, itos_dec, em_sz_dec, nh, out_sl, nl=2):
        super().__init__()
        self.emb_enc = create_emb(vecs_enc, itos_enc, em_sz_enc)
        self.nl,self.nh,self.out_sl = nl,nh,out_sl
        self.gru_enc = nn.GRU(em_sz_enc, nh, num_layers=nl, dropout=0.25, bidirectional=True)
        self.out_enc = nn.Linear(nh*2, em_sz_dec, bias=False)
        self.drop_enc = nn.Dropout(0.05)
        self.emb_dec = create_emb(vecs_dec, itos_dec, em_sz_dec)
        self.gru_dec = nn.GRU(em_sz_dec, em_sz_dec, num_layers=nl, dropout=0.1)
        self.emb_enc_drop = nn.Dropout(0.15)
        self.out_drop = nn.Dropout(0.35)
        self.out = nn.Linear(em_sz_dec, len(itos_dec))
        self.out.weight.data = self.emb_dec.weight.data
        
    def forward(self, inp):
        sl,bs = inp.size()
        h = self.initHidden(bs)
        emb = self.emb_enc_drop(self.emb_enc(inp))
        enc_out, h = self.gru_enc(emb, h)
        h = h.view(2,2,bs,-1).permute(0,2,1,3).contiguous().view(2,bs,-1)
        h = self.out_enc(self.drop_enc(h))

        dec_inp = V(torch.zeros(bs).long())
        res = []
        for i in range(self.out_sl):
            emb = self.emb_dec(dec_inp).unsqueeze(0)
            outp, h = self.gru_dec(emb, h)
            outp = self.out(self.out_drop(outp[0]))
            res.append(outp)
            dec_inp = V(outp.data.max(1)[1])
            if (dec_inp==1).all(): break
        return torch.stack(res)
    
    def initHidden(self, bs): return V(torch.zeros(self.nl*2, bs, self.nh))

In [None]:
rnn = Seq2SeqRNN_Bidir(fr_vecd, fr_itos, dim_fr_vec, en_vecd, en_itos, dim_en_vec, nh, enlen_90)
learn = RNN_Learner(md, SingleModel(to_gpu(rnn)), opt_fn=opt_fn)
learn.crit = seq2seq_loss

In [None]:
learn.fit(lr, 1, cycle_len=12, use_clr=(20,10))

In [None]:
learn.save('bidir')

### Teacher Forcing

What it is, why it matters

In [None]:
class Seq2SeqStepper(Stepper):
    def step(self, xs, y, epoch):
        self.m.pr_force = (10-epoch)*0.1 if epoch<10 else 0
        xtra = []
        output = self.m(*xs, y)
        if isinstance(output,tuple): output,*xtra = output
        self.opt.zero_grad()
        loss = raw_loss = self.crit(output, y)
        if self.reg_fn: loss = self.reg_fn(output, xtra, raw_loss)
        loss.backward()
        if self.clip:   # Gradient clipping
            nn.utils.clip_grad_norm(trainable_params_(self.m), self.clip)
        self.opt.step()
        return raw_loss.data[0]

In [None]:
class Seq2SeqRNN_TeacherForcing(nn.Module):
    def __init__(self, vecs_enc, itos_enc, em_sz_enc, vecs_dec, itos_dec, em_sz_dec, nh, out_sl, nl=2):
        super().__init__()
        self.emb_enc = create_emb(vecs_enc, itos_enc, em_sz_enc)
        self.nl,self.nh,self.out_sl = nl,nh,out_sl
        self.gru_enc = nn.GRU(em_sz_enc, nh, num_layers=nl, dropout=0.25)
        self.out_enc = nn.Linear(nh, em_sz_dec, bias=False)
        self.emb_dec = create_emb(vecs_dec, itos_dec, em_sz_dec)
        self.gru_dec = nn.GRU(em_sz_dec, em_sz_dec, num_layers=nl, dropout=0.1)
        self.emb_enc_drop = nn.Dropout(0.15)
        self.out_drop = nn.Dropout(0.35)
        self.out = nn.Linear(em_sz_dec, len(itos_dec))
        self.out.weight.data = self.emb_dec.weight.data
        self.pr_force = 1.
        
    def forward(self, inp, y=None):
        sl,bs = inp.size()
        h = self.initHidden(bs)
        emb = self.emb_enc_drop(self.emb_enc(inp))
        enc_out, h = self.gru_enc(emb, h)
        h = self.out_enc(h)

        dec_inp = V(torch.zeros(bs).long())
        res = []
        for i in range(self.out_sl):
            emb = self.emb_dec(dec_inp).unsqueeze(0)
            outp, h = self.gru_dec(emb, h)
            outp = self.out(self.out_drop(outp[0]))
            res.append(outp)
            dec_inp = V(outp.data.max(1)[1])
            if (dec_inp==1).all(): break
            if (y is not None) and (random.random()<self.pr_force):
                if i>=len(y): break
                dec_inp = y[i]
        return torch.stack(res)
    
    def initHidden(self, bs): return V(torch.zeros(self.nl, bs, self.nh))

In [None]:
rnn = Seq2SeqRNN_TeacherForcing(fr_vecd, fr_itos, dim_fr_vec, en_vecd, en_itos, dim_en_vec, nh, enlen_90)
learn = RNN_Learner(md, SingleModel(to_gpu(rnn)), opt_fn=opt_fn)
learn.crit = seq2seq_loss

In [None]:
learn.fit(lr, 1, cycle_len=12, use_clr=(20,10), stepper=Seq2SeqStepper)

In [None]:
learn.save('forcing')

#### Attentional Model

In [None]:
def rand_t(*sz): return torch.randn(sz)/math.sqrt(sz[0])
def rand_p(*sz): return nn.Parameter(rand_t(*sz))

In [None]:
class Seq2SeqAttnRNN(nn.Module):
    def __init__(self, vecs_enc, itos_enc, em_sz_enc, vecs_dec, itos_dec, em_sz_dec, nh, out_sl, nl=2):
        super().__init__()
        self.emb_enc = create_emb(vecs_enc, itos_enc, em_sz_enc)
        self.nl,self.nh,self.out_sl = nl,nh,out_sl
        self.gru_enc = nn.GRU(em_sz_enc, nh, num_layers=nl, dropout=0.25)
        self.out_enc = nn.Linear(nh, em_sz_dec, bias=False)
        self.emb_dec = create_emb(vecs_dec, itos_dec, em_sz_dec)
        self.gru_decgru_dec = nn.GRU(em_sz_dec, em_sz_dec, num_layers=nl, dropout=0.1)
        self.emb_enc_drop = nn.Dropout(0.15)
        self.out_drop = nn.Dropout(0.35)
        self.out = nn.Linear(em_sz_dec, len(itos_dec))
        self.out.weight.data = self.emb_dec.weight.data

        self.W1 = rand_p(nh, em_sz_dec)
        self.l2 = nn.Linear(em_sz_dec, em_sz_dec)
        self.l3 = nn.Linear(em_sz_dec+nh, em_sz_dec)
        self.V = rand_p(em_sz_dec)

    def forward(self, inp, y=None, ret_attn=False):
        sl,bs = inp.size()
        h = self.initHidden(bs)
        emb = self.emb_enc_drop(self.emb_enc(inp))
        enc_out, h = self.gru_enc(emb, h)
        h = self.out_enc(h)

        dec_inp = V(torch.zeros(bs).long())
        res,attns = [],[]
        w1e = enc_out @ self.W1
        for i in range(self.out_sl):
            w2h = self.l2(h[-1])
            u = F.tanh(w1e + w2h)
            a = F.softmax(u @ self.V, 0)
            attns.append(a)
            Xa = (a.unsqueeze(2) * enc_out).sum(0)
            emb = self.emb_dec(dec_inp)
            wgt_enc = self.l3(torch.cat([emb, Xa], 1))
            
            outp, h = self.gru_dec(wgt_enc.unsqueeze(0), h)
            outp = self.out(self.out_drop(outp[0]))
            res.append(outp)
            dec_inp = V(outp.data.max(1)[1])
            if (dec_inp==1).all(): break
            if (y is not None) and (random.random()<self.pr_force):
                if i>=len(y): break
                dec_inp = y[i]

        res = torch.stack(res)
        if ret_attn: res = res,torch.stack(attns)
        return res

    def initHidden(self, bs): return V(torch.zeros(self.nl, bs, self.nh)

In [None]:
rnn = Seq2SeqAttnRNN(fr_vecd, fr_itos, dim_fr_vec, en_vecd, en_itos, dim_en_vec, nh, enlen_90)
learn = RNN_Learner(md, SingleModel(to_gpu(rnn)), opt_fn=opt_fn)
learn.crit = seq2seq_loss

In [None]:
lr=2e-3

In [None]:
learn.fit(lr, 1, cycle_len=15, use_clr=(20,10), stepper=Seq2SeqStepper)

In [None]:
learn.save('attn')

In [None]:
learn.load('attn')

### Test Current Model

In [None]:
x,y = next(iter(val_dl))
probs,attns = learn.model(V(x),ret_attn=True)
preds = to_np(probs.max(2)[1])

In [None]:
for i in range(180,190):
    print(' '.join([fr_itos[o] for o in x[:,i] if o != 1]))
    print(' '.join([en_itos[o] for o in y[:,i] if o != 1]))
    print(' '.join([en_itos[o] for o in preds[:,i] if o!=1]))
    print()

In [None]:
attn = to_np(attns[...,180])

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(15, 10))
for i,ax in enumerate(axes.flat):
    ax.plot(attn[i])

### Summarised

In [None]:
class Seq2SeqRNN_All(nn.Module):
    def __init__(self, vecs_enc, itos_enc, em_sz_enc, vecs_dec, itos_dec, em_sz_dec, nh, out_sl, nl=2):
        super().__init__()
        self.emb_enc = create_emb(vecs_enc, itos_enc, em_sz_enc)
        self.nl,self.nh,self.out_sl = nl,nh,out_sl
        self.gru_enc = nn.GRU(em_sz_enc, nh, num_layers=nl, dropout=0.25, bidirectional=True)
        self.out_enc = nn.Linear(nh*2, em_sz_dec, bias=False)
        self.drop_enc = nn.Dropout(0.25)
        self.emb_dec = create_emb(vecs_dec, itos_dec, em_sz_dec)
        self.gru_dec = nn.GRU(em_sz_dec, em_sz_dec, num_layers=nl, dropout=0.1)
        self.emb_enc_drop = nn.Dropout(0.15)
        self.out_drop = nn.Dropout(0.35)
        self.out = nn.Linear(em_sz_dec, len(itos_dec))
        self.out.weight.data = self.emb_dec.weight.data

        self.W1 = rand_p(nh*2, em_sz_dec)
        self.l2 = nn.Linear(em_sz_dec, em_sz_dec)
        self.l3 = nn.Linear(em_sz_dec+nh*2, em_sz_dec)
        self.V = rand_p(em_sz_dec)
        
def forward(self, inp, y=None):
        sl,bs = inp.size()
        h = self.initHidden(bs)
        emb = self.emb_enc_drop(self.emb_enc(inp))
        enc_out, h = self.gru_enc(emb, h)
        h = h.view(2,2,bs,-1).permute(0,2,1,3).contiguous().view(2,bs,-1)
        h = self.out_enc(self.drop_enc(h))

        dec_inp = V(torch.zeros(bs).long())
        res,attns = [],[]
        w1e = enc_out @ self.W1
        for i in range(self.out_sl):
            w2h = self.l2(h[-1])
            u = F.tanh(w1e + w2h)
            a = F.softmax(u @ self.V, 0)
            attns.append(a)
            Xa = (a.unsqueeze(2) * enc_out).sum(0)
            emb = self.emb_dec(dec_inp)
            wgt_enc = self.l3(torch.cat([emb, Xa], 1))
            
            outp, h = self.gru_dec(wgt_enc.unsqueeze(0), h)
            outp = self.out(self.out_drop(outp[0]))
            res.append(outp)
            dec_inp = V(outp.data.max(1)[1])
            if (dec_inp==1).all(): break
            if (y is not None) and (random.random()<self.pr_force):
                if i>=len(y): break
                dec_inp = y[i]
        return torch.stack(res)

    def initHidden(self, bs): return V(torch.zeros(self.nl*2, bs, self.nh))

In [None]:
rnn = Seq2SeqRNN_All(fr_vecd, fr_itos, dim_fr_vec, en_vecd, en_itos, dim_en_vec, nh, enlen_90)
learn = RNN_Learner(md, SingleModel(to_gpu(rnn)), opt_fn=opt_fn)
learn.crit = seq2seq_loss

In [None]:
learn.fit(lr, 1, cycle_len=15, use_clr=(20,10), stepper=Seq2SeqStepper)

## Final Test

In [None]:
x,y = next(iter(val_dl))
probs = learn.model(V(x))
preds = to_np(probs.max(2)[1])

for i in range(180,190):
    print(' '.join([fr_itos[o] for o in x[:,i] if o != 1]))
    print(' '.join([en_itos[o] for o in y[:,i] if o != 1]))
    print(' '.join([en_itos[o] for o in preds[:,i] if o!=1]))
    print()