# HW 3: Neural Machine Translation

In this homework you will build a full neural machine translation system using an attention-based encoder-decoder network to translate from German to English. The encoder-decoder network with attention forms the backbone of many current text generation systems. See [Neural Machine Translation and Sequence-to-sequence Models: A Tutorial](https://arxiv.org/pdf/1703.01619.pdf) for an excellent tutorial that also contains many modern advances.

## Goals


1. Build a non-attentional baseline model (pure seq2seq as in [ref](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf)). 
2. Incorporate attention into the baseline model ([ref](https://arxiv.org/abs/1409.0473) but with dot-product attention as in class notes).
3. Implement beam search: review/tutorial [here](http://www.phontron.com/slides/nlp-programming-en-13-search.pdf)
4. Visualize the attention distribution for a few examples. 

Consult the papers provided for hyperparameters, and the course notes for formal definitions.

This will be the most time-consuming assignment in terms of difficulty/training time, so we recommend that you get started early!

## Setup

This notebook provides a working definition of the setup of the problem itself. Feel free to construct your models inline, or use an external setup (preferred) to build your system.

In [1]:
# Text text processing library and methods for pretrained word embeddings
from torchtext import data
from torchtext import datasets
import torch as t

We first need to process the raw data using a tokenizer. We are going to be using spacy, which can be installed via:  
  `[sudo] pip install spacy`  
  
Tokenizers for English/German can be installed via:  
  `[sudo] python -m spacy download en`  
  `[sudo] python -m spacy download de`
  
This isn't *strictly* necessary, and you can use your own tokenization rules if you prefer (e.g. a simple `split()` in addition to some rules to acccount for punctuation), but we recommend sticking to the above.

In [2]:
import spacy
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]


Note that we need to add the beginning-of-sentence token `<s>` and the end-of-sentence token `</s>` to the 
target so we know when to begin/end translating. We do not need to do this on the source side.

In [3]:
BOS_WORD = '<s>'
EOS_WORD = '</s>'
DE = data.Field(tokenize=tokenize_de)
EN = data.Field(tokenize=tokenize_en, init_token = BOS_WORD, eos_token = EOS_WORD) # only target needs BOS/EOS

Let's download the data. This may take a few minutes.

**While this dataset of 200K sentence pairs is relatively small compared to others, it will still take some time to train. So we are going to be only working with sentences of length at most 20 for this homework. Please train only on this reduced dataset for this homework.**

In [4]:
MAX_LEN = 20
train, val, test = datasets.IWSLT.splits(exts=('.de', '.en'), fields=(DE, EN), 
                                         filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and 
                                         len(vars(x)['trg']) <= MAX_LEN)
print(train.fields)
print(len(train))
print(vars(train[0]))

{'src': <torchtext.data.field.Field object at 0x7f4c7d350320>, 'trg': <torchtext.data.field.Field object at 0x7f4c7d350358>}
119076
{'src': ['David', 'Gallo', ':', 'Das', 'ist', 'Bill', 'Lange', '.', 'Ich', 'bin', 'Dave', 'Gallo', '.'], 'trg': ['David', 'Gallo', ':', 'This', 'is', 'Bill', 'Lange', '.', 'I', "'m", 'Dave', 'Gallo', '.']}


Now we build the vocabulary and convert the text corpus into indices. We are going to be replacing tokens that occurred less than 5 times with `<unk>` tokens, and take the rest as our vocab.

In [5]:
MIN_FREQ = 5
DE.build_vocab(train.src, min_freq=MIN_FREQ)
EN.build_vocab(train.trg, min_freq=MIN_FREQ)
print(DE.vocab.freqs.most_common(10))
print("Size of German vocab", len(DE.vocab))
print(EN.vocab.freqs.most_common(10))
print("Size of English vocab", len(EN.vocab))
print(EN.vocab.stoi["<s>"], EN.vocab.stoi["</s>"]) #vocab index for <s>, </s>

[('.', 113253), (',', 67237), ('ist', 24189), ('die', 23778), ('das', 17102), ('der', 15727), ('und', 15622), ('Sie', 15085), ('es', 13197), ('ich', 12946)]
Size of German vocab 13353
[('.', 113433), (',', 59512), ('the', 46029), ('to', 29177), ('a', 27548), ('of', 26794), ('I', 24887), ('is', 21775), ("'s", 20630), ('that', 19814)]
Size of English vocab 11560
2 3


Now we split our data into batches as usual. Batching for MT is slightly tricky because source/target will be of different lengths. Fortunately, `torchtext` lets you do this by allowing you to pass in a `sort_key` function. This will minimizing the amount of padding on the source side, but since there is still some padding you will inadvertendly "attend" to these padding tokens. 

One way to get rid of padding is to pass a binary `mask` vector to your attention module so its attention score (before the softmax) is minus infinity for the padding token. Another way (which is how we do it for our projects, e.g. opennmt) is to manually sort data into batches so that each batch has exactly the same source length (this means that some batches will be less than the desired batch size, though).

However, for this homework padding won't matter too much, so it's fine to ignore it.

In [6]:
BATCH_SIZE = 32
train_iter, val_iter = data.BucketIterator.splits((train, val), batch_size=BATCH_SIZE, device=-1,
                                                  repeat=False, sort_key=lambda x: len(x.src))

Let's check to see that the BOS/EOS token is indeed appended to the target (English) sentence.

In [None]:
batch = next(iter(train_iter))
print("Source")
print(batch.src)
print("Target")
print(batch.trg)


Success! Now that we've processed the data, we are ready to begin modeling.

## Assignment

Now it is your turn to build the models described at the top of the assignment. 

When a model is trained, use the following test function to produce predictions, and then upload to the kaggle competition: https://www.kaggle.com/c/cs287-hw3-s18/

For the final Kaggle test, we will provide the source sentence, and you are to predict the **first three words of the target sentence**. The source sentence can be found under `source_test.txt`

In [None]:
!head source_test.txt

Similar to HW1, you are to predict the 100 most probable 3-gram that will begin the target sentence. The submission format will be as follows, where each word in the 3-gram will be separated by "|", and each 3-gram will be separated by space. For example, here is what an example submission might look like with 5 most-likely 3-grams (instead of 100).

```
id,word
1,Newspapers|talk|about When|I|was Researchers|call|the Twentysomethings|like|Alex But|before|long
2,That|'s|what Newspapers|talk|about You|have|robbed It|'s|realizing My|parents|wanted
3,We|forget|how We|think|about Proust|actually|links Does|any|other This|is|something
4,But|what|do And|it|'s They|'re|on My|name|is It|only|happens
```

When you print out your data, you will need to escape quotes and commas with the following command so that Kaggle does not complain. 

In [None]:
def escape(l):
    return l.replace("\"", "<quote>").replace(",", "<comma>")

You should perform your hyperparameter search/early stopping/write-up based on perplexity, not the above metric. (In practice, people use a metric called [BLEU](https://www.aclweb.org/anthology/P02-1040.pdf), which is roughly a geometric average of 1-gram, 2-gram, 3-gram, 4-gram precision, with a brevity penalty for producing translations that are too short.)

Finally, as always please put up a (short) write-up following the template provided in the repository:  https://github.com/harvard-ml-courses/cs287-s18/blob/master/template/


In [7]:
import os
os.chdir('../HW3')

In [8]:
import torch as t
import numpy as np
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import os

os.chdir('../HW3')  # so that there is not any import bug in case HW2 is not already the working directory
from utils import *
from const import *
import argparse
import torch as t
from process_params import check_args, get_params
from const import *
from data_process import generate_iterators
from utils import *
t.manual_seed(1)

import torchtext
from torchtext.vocab import Vectors, GloVe
from utils import variable
from const import *
import numpy as np
from torch.autograd import Variable
import spacy
from torchtext import data
from torchtext import datasets
import pickle
from copy import deepcopy

In [None]:
'''class LSTMA(t.nn.Module):
    """
    Implementation of `Neural Machine Translation by Jointly Learning to Align and Translate`
    https://arxiv.org/abs/1409.0473

    NOTE THAT ITS INPUT SHOULD HAVE THE BATCH SIZE FIRST !!!!!
    """

    def __init__(self, params, source_embeddings=None, target_embeddings=None):
        super(LSTMA, self).__init__()
        print("Initializing LSTMA")
        self.cuda_flag = params.get('cuda', CUDA_DEFAULT)
        self.model_str = 'LSTMA'
        self.params = params

        # Initialize hyperparams.
        self.hidden_dim = params.get('hidden_dim', 100)
        self.batch_size = params.get('batch_size', 32)
        try:
            # if you provide pre-trained embeddings for target/source, they should have the same embedding dim
            assert source_embeddings.size(1) == target_embeddings.size(1)
            self.embedding_dim = source_embeddings.size(1)
            self.source_vocab_size = params.get('source_vocab_size')
            self.target_vocab_size = params.get('target_vocab_size')
        except:
            # if you dont provide a pre-trained embedding, you have to provide these
            self.embedding_dim = params.get('embedding_dim')
            self.source_vocab_size = params.get('source_vocab_size')
            self.target_vocab_size = params.get('target_vocab_size')
            assert self.embedding_dim is not None and self.source_vocab_size is not None and self.target_vocab_size is not None
        self.output_size = self.target_vocab_size
        self.num_layers = params.get('num_layers', 1)
        self.dropout = params.get('dropout', 0.5)
        self.embed_dropout = params.get('embed_dropout')
        self.train_embedding = params.get('train_embedding', True)

        # Initialize embeddings. Static embeddings for now.
        self.source_embeddings = t.nn.Embedding(self.source_vocab_size, self.embedding_dim)
        self.target_embeddings = t.nn.Embedding(self.target_vocab_size, self.embedding_dim)
        if source_embeddings is not None:
            self.source_embeddings.weight = t.nn.Parameter(source_embeddings, requires_grad=self.train_embedding)
        if target_embeddings is not None:
            self.target_embeddings.weight = t.nn.Parameter(target_embeddings, requires_grad=self.train_embedding)

        # Initialize network modules.
        # note that the encoder is a BiLSTM. The output is modified by the fact that the hidden dim is doubled, and if you set
        # the number of layers to L, there will actually be 2L layers (the forward ones and the backward ones). Consequently the first
        # dimension of the hidden outputs of the forward pass (the 2nd output in the tuple) will be a tuple of
        # 2 tensors having as first dim twice the hidden dim you set
        self.encoder_rnn = t.nn.LSTM(self.embedding_dim, self.hidden_dim // 2, dropout=self.dropout, num_layers=self.num_layers, bidirectional=True, batch_first=True)
        self.decoder_rnn = t.nn.LSTM(self.embedding_dim, self.hidden_dim, dropout=self.dropout, num_layers=self.num_layers, batch_first=True)
        self.hidden_dec_initializer = t.nn.Linear(self.hidden_dim // 2, self.hidden_dim)
        self.hidden2out = t.nn.Linear(self.hidden_dim * 2, self.output_size)
        if self.embed_dropout:
            self.dropout_1s = t.nn.Dropout(self.dropout)
            self.dropout_1t = t.nn.Dropout(self.dropout)
        self.dropout_2 = t.nn.Dropout(self.dropout)
        self.lsm = nn.LogSoftmax()
        
        self.beam_size = params.get('beam_size',3)
        self.max_beam_depth = params.get('max_beam_depth',20)

        if self.cuda_flag:
            self = self.cuda()

    # @todo: maybe this is wrong in case of deep LSTM DECODER (I am not sure the dimensions are correct)
    def init_hidden(self, data, type, batch_size=None):
        """
        Initialize the hidden state, either for the encoder or the decoder

        For type=`enc`, it should just be initialized with 0s
        For type=`dec`, it should be initialized with tanh(W h1_backward) (see page 13 of the paper, last paragraph)

        `data` is either something you initialize the hidden state with, or None
        """
        bs = batch_size if batch_size is not None else self.batch_size
        if type == 'dec':
            # in that case, `data` is the output of the encoder
            # data[:, :1, self.hidden_dim // 2:]
            # `:` for the whole batch
            # `:1` because you want the hidden state of the first time step (see paper, they use backward(h1))
            # but also `self.hidden_dim // 2:`, because you want the backward part only (the last coefficients)
            h = F.tanh(self.hidden_dec_initializer(data[:, :1, self.hidden_dim // 2:]))  # @todo: verify that the last hdim/2 weights actually correspond to the backward layer(s)
            h = h.transpose(1, 0)
            return (
                h,
                variable(np.zeros((self.num_layers, bs, self.hidden_dim)), cuda=self.cuda_flag)
            )
        elif type == 'enc':
            # in that case data is None
            return tuple((
                variable(np.zeros((self.num_layers * 2, bs, self.hidden_dim // 2)), cuda=self.cuda_flag),
                variable(np.zeros((self.num_layers * 2, bs, self.hidden_dim // 2)), cuda=self.cuda_flag)
            ))
        else:
            raise ValueError('the type should be either `dec` or `enc`')

    def forward(self, x_source, x_target):
        # EMBEDDING
        embedded_x_source = self.source_embeddings(x_source)
        embedded_x_target = self.target_embeddings(x_target[:, :-1])  # don't make a prediction for the word following the last one
        if self.embed_dropout:
            embedded_x_source = self.dropout_1s(embedded_x_source)
            embedded_x_target = self.dropout_1t(embedded_x_target)

        # RECURRENT
        hidden = self.init_hidden(None, 'enc', x_source.size(0))
        enc_out, _ = self.encoder_rnn(embedded_x_source, hidden)
        hidden = self.init_hidden(enc_out, 'dec', x_source.size(0))
        dec_out, _ = self.decoder_rnn(embedded_x_target, hidden)

        # ATTENTION
        scores = t.bmm(enc_out, dec_out.transpose(1, 2))  # this will be a batch x source_len x target_len
        attn_dist = F.softmax(scores, dim=1)  # batch x source_len x target_len
        context = t.bmm(attn_dist.permute(0, 2, 1), enc_out)  # batch x target_len x hidden_dim

        # OUTPUT
        # concatenate the output of the decoder and the context and apply nonlinearity
        pred = F.tanh(t.cat([dec_out, context], -1))  # @todo : tanh necessary ?
        pred = self.dropout_2(pred)  # batch x target_len x 2 hdim
        pred = self.hidden2out(pred)
        return pred

    def translate(self, x_source):
        self.eval()

        # EMBEDDING
        embedded_x_source = self.source_embeddings(x_source)
        if self.embed_dropout:
            embedded_x_source = self.dropout_1s(embedded_x_source)

        # RECURRENT
        hidden = self.init_hidden(None, 'enc')
        enc_out, _ = self.encoder_rnn(embedded_x_source, hidden)
        hidden = self.init_hidden(enc_out, 'dec')
        x_target = (SOS_TOKEN * t.ones(x_source.size(0), 1)).long()  # `2` is the SOS token (<s>)
        x_target = variable(x_target, to_float=False, cuda=self.cuda_flag)
        count_eos = 0
        time = 0
        while count_eos < x_source.size(0):
            embedded_x_target = self.target_embeddings(x_target)
            dec_out, hidden = self.decoder_rnn(embedded_x_target, hidden)
            hidden = hidden[0].detach(), hidden[1].detach()
            dec_out = dec_out[:, time:time + 1, :].detach()

            # ATTENTION
            scores = t.bmm(enc_out, dec_out.transpose(1, 2))  # this will be a batch x source_len x target_len
            try:
                attn_dist = F.softmax(scores, dim=1)  # batch x source_len x target_len
            except:
                attn_dist = F.softmax(scores.permute(1, 0, 2)).permute(1, 0, 2)
            context = t.bmm(attn_dist.permute(0, 2, 1), enc_out)  # batch x target_len x hidden_dim

            # OUTPUT
            # concatenate the output of the decoder and the context and apply nonlinearity
            pred = F.tanh(t.cat([dec_out, context], -1))  # @todo : tanh necessary ?
            pred = self.dropout_2(pred)  # batch x target_len x 2 hdim
            pred = self.hidden2out(pred).detach()
            x_target = t.cat([x_target, pred.max(2)[1]], 1).detach()

            # should you stop ?
            count_eos += t.sum((pred.max(2)[1] == EOS_TOKEN).long()).data.cpu().numpy()[0]  # `3` is the EOS token
            time += 1
        return x_target
    
    def translate_beam(self,x_source,print_beam_row=-1):
        self.eval()        

        # EMBEDDING
        embedded_x_source = self.source_embeddings(x_source)
        if self.embed_dropout:
            embedded_x_source = self.dropout_1s(embedded_x_source)
        
        terminate_beam = False
        batch_size = x_source.size(0)
        
        # RECURRENT
        hidden = self.init_hidden(None, 'enc')
        enc_out, _ = self.encoder_rnn(embedded_x_source, hidden)
        hidden = self.init_hidden(enc_out, 'dec')
        x_target = SOS_TOKEN * np.ones((x_source.size(0), 1))  # `2` is the SOS token (<s>)
        count_eos = 0
        time = 0        
        
        #INIT SOME STUFF.
        self.beam = np.array([x_target])
        self.beam_scores = np.zeros((batch_size,1))
        
        while not terminate_beam and time < self.max_beam_depth: 
            
            collective_children   = np.array([])
            collective_scores     = np.array([])
           
            if len(self.beam) == 1:
                reshaped_beam = self.beam
            else:
                reshaped_beam = np.transpose(self.beam,(1,0,2))
            
            for it, elem in enumerate(reshaped_beam) : 
                elem = t.from_numpy(elem).long()
                x_target = elem.contiguous().view(self.batch_size,-1)
                x_target = variable(x_target, to_float=False, cuda=self.cuda_flag).long()
                embedded_x_target = self.target_embeddings(x_target)
                dec_out, hidden = self.decoder_rnn(embedded_x_target, hidden)
                hidden = hidden[0].detach(), hidden[1].detach()
                dec_out = dec_out[:, time:time + 1, :].detach()
    
                # ATTENTION
                scores = t.bmm(enc_out, dec_out.transpose(1, 2))  # this will be a batch x source_len x target_len
                try:
                    attn_dist = F.softmax(scores, dim=1)  # batch x source_len x target_len
                except:
                    attn_dist = F.softmax(scores.permute(1, 0, 2)).permute(1, 0, 2)
                context = t.bmm(attn_dist.permute(0, 2, 1), enc_out)  # batch x target_len x hidden_dim
    
                # OUTPUT
                # concatenate the output of the decoder and the context and apply nonlinearity
                pred = F.tanh(t.cat([dec_out, context], -1))  # @todo : tanh necessary ?
                pred = self.dropout_2(pred)  # batch x target_len x 2 hdim
                pred = self.hidden2out(pred).detach()
                
                pred = self.lsm(pred.view(batch_size,-1)).detach()

                topk = t.topk(pred, self.beam_size,dim=1)
                #import pdb; pdb.set_trace()
            
                #topk dimensions - batch * 1 * beam
                top_k_indices, top_k_scores = topk[1],topk[0]
                
                #temporarily get them in beam*batch dimensions to iterate over each beam element.
                top_k_indices = top_k_indices.transpose(0,1)
                top_k_scores = top_k_scores.transpose(0,1)              
                
                for new_word_batch, new_score_batch in zip(top_k_indices, top_k_scores):    
                    #import pdb; pdb.set_trace()  
                    new_word_batch= new_word_batch.contiguous().view(batch_size,1)
                    new_score_batch = new_score_batch.contiguous().view(batch_size,1) 
                    new_child_batch = t.cat([x_target,new_word_batch],1).detach()                    
                   
                    batch_parent_score = self.beam_scores[:,it].reshape((self.batch_size,1))
                    batch_acc_score =  batch_parent_score + new_score_batch.data.cpu().numpy()        
                
                    if len(collective_children) > 0:
                        collective_children = np.hstack((collective_children, new_child_batch.data.cpu().numpy())) 
                        #Add the corresponding beam element's score with the new score and stack it.
                        collective_scores   = np.hstack((collective_scores, batch_acc_score ))              
                    else:
                        collective_children, collective_scores = new_child_batch.data.cpu().numpy(),batch_acc_score               
            #import pdb; pdb.set_trace()
                     
            #At the end of a for loop collective children, collective scores 
            #will look a numpy array of tensors.            
            current_beam_length = 1 #Means only start elem is there.
            if len(self.beam)!= 1:
                current_beam_length = self.beam.shape[1]  
            
            #import pdb; pdb.set_trace()
                  
            collective_children = collective_children.reshape((batch_size, current_beam_length*self.beam_size, 
                                                               int(collective_children.shape[1]/
                                                                   current_beam_length/self.beam_size)
                                                             ))
            
            if collective_children.shape[1] == self.beam_size:  #Happens the first time.
                self.beam = collective_children  
                self.beam_scores = collective_scores
                if print_beam_row > -1:
                    for l in range(self.beam_size):
                        print([EN.vocab.itos[int(x)] for x in self.beam[print_beam_row,int(l)]])
                             
                
            else:
                self.beam = deepcopy(np.zeros((batch_size,self.beam_size,collective_children.shape[2])))
                for i in range(batch_size):
                    #Since argsort gives ascending order
                    #import pdb; pdb.set_trace()
                    best_scores_indices = np.argsort(-1*collective_scores[i])[:self.beam_size]  
                    for key,index in enumerate(best_scores_indices):
                        self.beam[i][key][:] = collective_children[i][index]                       
                        self.beam_scores[i][key] = collective_scores[i][index]
                if print_beam_row > -1:
                    for l in range(self.beam_size):
                        print([EN.vocab.itos[int(x)] for x in self.beam[print_beam_row,int(l)]])
           
            
            terminate_beam = True
            
            for x in self.beam:
                    for c in x:
                        if EOS_TOKEN not in c:
                            terminate_beam = False
                            break
                    if not terminate_beam:
                        break   
            #import pdb; pdb.set_trace()
            assert(self.beam.shape == (batch_size,self.beam_size,time+2))

            time += 1                 
            
        return self.beam 
    '''

In [10]:
import json
from utils import load_model
import os
os.chdir('../HW3')
from data_process import generate_kaggle_text


from translation_models import LSTMA
from const import *
with open('LSTMA/4.params.json', 'r') as f:
    params = json.load(f)
params['beam_size'] = 3
lstma = LSTMA(params).cuda()
load_model(lstma, 'LSTMA/4.pytorch', cuda=True)

Initializing LSTMA


In [11]:
val_iter.batch_size = 64
for batch in val_iter:
    pred_beam = lstma.translate_beam(batch.src.transpose(0,1).cuda()) 
    break

  pred = self.lsm(pred.view(batch_size,-1)).detach()


1
2
3
4
5
6
7
8


In [None]:
val_iter.batch_size = 64
total_sentences = 800
batch_count = 0
beam_size = 3
batch_size = val_iter.batch_size
expt_name = "LSTM_Attention"
num_words = 3

generate_kaggle_text(val_iter, lstma, EN, batch_size, beam_size, total_sentences, num_words =3, expt_name = "LSTM_Attention", debug = True, print_on_screen = True)
        
#Turn print_beam_row to -1 to disable printing. 
#Turn it to row number that you want to see evolve in beam search.Works only in ipython for now. Form making it work, uncomment
#code in the above cell and use that class.

In [12]:
val_iter.batch_size = 64
for batch in val_iter:
    break
pred = lstma.translate(batch.src.transpose(0,1).cuda())

###  TEST BEAM PREDICTION - remember beam is of dimension - batch size * beam size . so pred_beam[a][b] will give ath element 

in the batch and its bth beam value. b= 0 will be the most likely option.

In [15]:
batch_elem = 8
print("beam's best prediction(Change 0 to other elem for other predicitons) : ")
print([EN.vocab.itos[int(x)] for x in pred_beam[batch_elem][0]])
print("Greedy prediction : ")
print([EN.vocab.itos[int(x)] for x in pred[batch_elem].data.cpu().numpy()])
print("actual text")
print([EN.vocab.itos[x] for x in batch.trg.transpose(0,1)[batch_elem].data.numpy()])


beam's best prediction(Change 0 to other elem for other predicitons) : 
['<s>', 'The', 'book', 'was', 'published', '.', '</s>', '.', '</s>']
Greedy prediction : 
['<s>', 'The', 'book', 'was', 'published', 'in', '2009']
actual text
['<s>', 'The', 'book', 'was', 'published', 'in', '2009', '.', '</s>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']


In [None]:
num_lines = sum(1 for line in open('source_test.txt'))
print("Num lines in text file : ", num_lines)
print("Test dataset size : ", len(test))

In [None]:
BATCH_SIZE = 64
train_iter, val_iter, test_iter = data.BucketIterator.splits((train, val,test), batch_size=BATCH_SIZE, device=-1, shuffle = False, repeat=False)
    

In [None]:
for batch in test_iter:
    import pdb; pdb.set_trace()
    print([DE.vocab.itos[x] for x in batch.src.data.transpose(1,0).numpy()[0]])
    break


    