Tasks:
#### 1. Data preparation

- Download a translation dataset (pick a language pair) from https://www.manythings.org/anki/



- Alternatively, if you prefer, download morphological segmentation data from http://turing.iimas.unam.mx/wix/static/resources/language_data.tar.bz2. You can choose which language you want to work with or even try combining them. This dataset will likely be a bit faster to train than the MT one above. Also, one trick that has been shown to work on this type of data is to add in random strings that map to themselves, in order to teach the decoder to output mostly the same characters as it sees in the input (with the addition of the morpheme boundary characters).



- Create three .tsv files, one for each of train/dev/test partitions (if you use the MT data you will need to choose how to split the data, probably something like 70%/15%/15% would work). Once you have the data in this format, you smiply need to update the code in the "Load Data" section to load your data.


- For the NMT dataset, you may need to update the tokenization function depending on your language(s).


#### 2. Compare RNN Decoder (`Decoder` in the code) vs. RNN Decoder with Attention (`AttentionDecoder` in the code) 
- Read through the code for the Encoder, Attention, and the two Decoder classes. Make sure you have some understanding of what is going on before preceding.

- Train model (for ~50-100 epochs? more if time permits...) using the "Vanilla Decoder", which is the default.


- Make the necessary changes to the code (there should only be 2, there places are marked with a "TODO" comment) in order to run the same experiment with The AttentionDecoder.


- Compare the results (eg the validation loss). Do you notice any difference? For now, just look at the validation loss.

- Add to the `evaluate` function so that you also report a metric (you choose what metric).


#### 3. (Bonus 1) Implement Teacher Forcing
- Currently, in the `Seq2Seq` class's `forward()` method, there is a parameter called `teacher_forcing_ratio`, but we don't use it. "Teacher forcing" is a technique for training seq2seq models where, at each timestep, you give the decoder the correct output from the previous time step with some probability (instead of always feeding it the prediction from the previous time step, which could be wrong). Implement teacher forcing in this method. Assume `teacher_forcing_ratio` is a float between 0 and 1, and indicates the proportion of time we give the correct input to the decoder.

#### 4. (Bonus 2: pobably more relevant for the morphological segmentation corpus) Compare with RNN Transducer
- Train an RNN Transducer (from a few weeks ago) on the same data and compare the performances. 


In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from torchtext.data import Field, TabularDataset, BucketIterator

import numpy as np

import random
import math
import time

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## Load Data
We will use utilities from the Pytorch package "torchtext" to easily load the data and batch it using buckets according to length (in order to minimize padding)

In [104]:
char_tokenize = lambda s: s.split()
SRC = Field(tokenize=char_tokenize, init_token='<sow>', eos_token='<eow>', lower=True)
TGT = Field(tokenize=char_tokenize, init_token='<sow>', eos_token='<eow>', lower=True)



In [105]:
#
# TODO: Update this cell to load the dataset you chose. Once you have your data in 3 tsv 
# files (one per train/dev/test), just update the path and the names of the files.
#
path_to_data = "./"

train_data, val_data, test_data = TabularDataset.splits(
        path=path_to_data, train='train_w_words_wixarika.tsv',
        validation='wixarika_dev.tsv', test='wixarika_test.tsv', format='tsv',
        fields=[('src', SRC), ('tgt', TGT)])


SRC.build_vocab(train_data)
TGT.build_vocab(train_data)



In [106]:
#
# TODO: play with the batch size. Depending on your machine and dataset you may be able to get 
# away with much larger batches.
#
BATCH_SIZE = 24


(train_iterator, valid_iterator, test_iterator) = BucketIterator.splits(
    (train_data, val_data, test_data),
    batch_size=BATCH_SIZE,
    device=device,
    sort_key=lambda x: len(x.src) # batch by length in order to minimize sequence padding
)



## Define Model (Encoder, Decoder, Attention Layer, and Decoder with Attention)
We define both a "standard" decoder and an attention decoder, so that we can evaluate the impact of attention

### Encoder

In [107]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):        
        embedded = self.dropout(self.embedding(src))     
        outputs, hidden = self.rnn(embedded)
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
        return outputs, hidden

### Vanilla Decoder (no attention mechanism)

In [108]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, emb_dim, dec_hid_dim, dropout):
        super().__init__()
        self.output_dim = vocab_size
        self.hid_dim = dec_hid_dim
        self.embedding = nn.Embedding(vocab_size, emb_dim)
        self.rnn = nn.GRU(emb_dim, dec_hid_dim)
        self.fc_out = nn.Linear(dec_hid_dim, vocab_size)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden):
        #
        # On the first time step, the hidden tensor 
        # (the context vector from the encoder) is only 2d, 
        # so we unsqueeze it.
        #
        if len(hidden.shape) == 2:
            hidden = hidden.unsqueeze(0)
            
        input = input.unsqueeze(0)        
        embedded = self.dropout(self.embedding(input))                
        outputs, hidden = self.rnn(embedded, hidden)
        prediction = self.fc_out(outputs.squeeze(0))        
        return prediction, hidden



### Attention

In [109]:
class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs):
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        #
        # Repeat decoder hidden state src_len times in order to concatenate it 
        # with the encoder outputs.
        #
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        # energy shape: [batch size, src len, dec hid dim]
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2)))
        # attention shape: [batch size, src len]
        attention = self.v(energy).squeeze(2)
        
        return F.softmax(attention, dim=1)

### Decoder with Attention

In [110]:
class AttentionDecoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()
        self.output_dim = output_dim
        self.attention = attention
        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
        input = input.unsqueeze(0)
        embedded = self.dropout(self.embedding(input))
        
        a = self.attention(hidden, encoder_outputs) 
        a = a.unsqueeze(1)
                
        #
        # Get weighted sum of encoder states (weighted by attention vector)
        #
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        weighted = torch.bmm(a, encoder_outputs)
        weighted = weighted.permute(1, 0, 2)
                
        rnn_input = torch.cat((embedded, weighted), dim = 2)            
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        #
        # Also feed the input embedding and the attended encoder representation 
        # to the fully connected output layer.
        #
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1))
                
        return prediction, hidden.squeeze(0)

## Putting it all together (the Seq2Seq model)

In [111]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #encoder_outputs is all hidden states of the input sequence, back and forwards
        #hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src)
                
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]

        
        for t in range(1, trg_len):
            
            #
            # TODO: change to accomodate the AttentionDecoder forward() call.
            #
            # answer: output, hidden = self.decoder(input, hidden, encoder_outputs) 
            output, hidden = self.decoder(input, hidden)  

            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            input = trg[t]
            #
            # TODO: (Bonus task) implement teacher forcing here
            #
            # Answer:
            # if teacher_forcing_ratio > 0:
            #     teacher_force = random.random() < teacher_forcing_ratio
            #     input = trg[t] if teacher_force else top1
            # else:
            #     input = trg[t]

        return outputs

## Training Logic

In [132]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TGT.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)

# 
# TODO:
# The following line defines the decoder as a Vanilla RNN (GRU) Decoder (i.e. no attention). 
# Your task is to update this line to use the Bahdanau decoder (AttentionDecoder). You will
# need to check out the __init__ method of AttentionDecoder to make sure you are passing it the
# appropriate args.
#
# ANSWER: dec = AttentionDecoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, DEC_HID_DIM, DEC_DROPOUT)


model = Seq2Seq(enc, dec, device).to(device)

def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(30, 256)
    (rnn): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(31, 256)
    (rnn): GRU(256, 512)
    (fc_out): Linear(in_features=512, out_features=31, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [133]:


def train(model, iterator, optimizer, criterion, clip):
    model.train()
    epoch_loss = 0
    print("Starting training...")
    for i, batch in enumerate(iterator):
        src = batch.src
        trg = batch.tgt
        optimizer.zero_grad()
        output = model(src, trg, 0.5)  # use teacher forcing during training only.
        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)        
        
        loss = criterion(output, trg)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [134]:
def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0    
    with torch.no_grad():
        for i, batch in enumerate(iterator):
            src = batch.src
            trg = batch.tgt
            output = model(src, trg, 0) # turn off teacher forcing   
            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)
            loss = criterion(output, trg)
            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [135]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
N_EPOCHS = 150
CLIP = 1
optimizer = optim.Adam(model.parameters(), lr=0.01)
TRG_PAD_IDX = TGT.vocab.stoi[TGT.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

train_losses, val_losses = [], []

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    train_losses.append(train_loss)
    val_losses.append(valid_loss)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f}')
    

Starting training...
Epoch: 01 | Time: 0m 19s
	Train Loss: 2.227
	 Val. Loss: 1.371
Starting training...
Epoch: 02 | Time: 0m 20s
	Train Loss: 1.736
	 Val. Loss: 1.190
Starting training...
Epoch: 03 | Time: 0m 21s
	Train Loss: 1.630
	 Val. Loss: 1.290
Starting training...
Epoch: 04 | Time: 0m 21s
	Train Loss: 1.661
	 Val. Loss: 1.251
Starting training...
Epoch: 05 | Time: 0m 18s
	Train Loss: 1.690
	 Val. Loss: 1.223
Starting training...
Epoch: 06 | Time: 0m 19s
	Train Loss: 1.522
	 Val. Loss: 1.245
Starting training...
Epoch: 07 | Time: 0m 19s
	Train Loss: 1.454
	 Val. Loss: 1.054
Starting training...
Epoch: 08 | Time: 0m 19s
	Train Loss: 1.262
	 Val. Loss: 1.081
Starting training...
Epoch: 09 | Time: 0m 19s
	Train Loss: 1.455
	 Val. Loss: 1.512
Starting training...
Epoch: 10 | Time: 0m 19s
	Train Loss: 1.888
	 Val. Loss: 1.406
Starting training...
Epoch: 11 | Time: 0m 18s
	Train Loss: 1.971
	 Val. Loss: 1.541
Starting training...
Epoch: 12 | Time: 0m 18s
	Train Loss: 2.180
	 Val. Loss

In [81]:
! head wixarika_dev.tsv


k e n e u p i t + a	k e ! n ! e u ! p i ! t + a
p a p a t +	p a ! p a ! t +
m a y e m a	m ! a ! y e ! m a
p e m +	p ! e ! m +
w i k + a y a t +	w i k + a y a ! t +
m a t s i ' u t a x a n e t a x +	m a ! t s i ' ! u ! t a ! x a n e t a ! x +
n e p i t i t u a n i	n e ! p ! i ! t i ! t u a ! n i
n e x a w e r i	n e ! x a w e r i
p u k a w e	p ! u ! k a ! w e
k u r a r u	k u r a r u


In [137]:
with torch.no_grad():
    for i, batch in enumerate(test_iterator):
        src = batch.src
        trg = batch.tgt
        output = model(src, trg, 0) # turn off teacher forcing  
        break
        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        loss = criterion(output, trg)
        
        

In [143]:
def ismatch(outputseq, targetseq):
    for i, char in enumerate(outputseq):
        if char == "<eow>" and targetseq[i] == char:
            return True
        if char != targetseq[i]:
            return False
        
def accuracy(iterator):
    correct, total = 0, 0

    for i, batch in enumerate(iterator):
        src = batch.src
        trg = batch.tgt
        output = model(src, trg, 0) # turn off teacher forcing 
        predseqs, targetseqs = [], []
        for pred in output[1:].argmax(2).transpose(1, 0):
            predseqs.append([TGT.vocab.itos[i.item()] for i in pred])
        for pred in trg[1:].transpose(1, 0):
            targetseqs.append([TGT.vocab.itos[i.item()] for i in pred])
        
        for i, seq in enumerate(predseqs):
            if ismatch(seq, targetseqs[i]):
                correct += 1
                total += 1
            else:
                total += 1
    return correct / total
                    
accuracy(test_iterator)

0.21338155515370705

In [139]:
for pred in output[1:].argmax(2).transpose(1, 0):
    print([TGT.vocab.itos[i.item()] for i in pred])

['m', '!', 'm', 'e', '<eow>', '<eow>', '<eow>']
['x', '+', 'k', 'e', '<eow>', '<eow>', '<eow>']
["'", 'u', 't', 't', 'a', '<eow>', '<eow>']
['p', 'a', 'a', '!', 'k', 'a', '<eow>']
['k', 'i', 'y', 'y', 'a', '<eow>', '<eow>']
['p', '!', '!', 'k', 'a', '<eow>', '<eow>']
['n', 'a', 'w', 'a', '<eow>', '<eow>', '<eow>']
['w', 'i', 'k', 'k', 'i', '<eow>', '<eow>']
['m', 'a', 'n', 'a', '<eow>', '<eow>', '<eow>']
['m', '+', 'y', 'y', '+', '<eow>', '<eow>']
['m', '+', '<eow>', '<eow>', '<eow>', '<eow>', '<eow>']
['u', 'w', 'a', '<eow>', '<eow>', '<eow>', '<eow>']
['e', 'n', 'a', '<eow>', '<eow>', '<eow>', '<eow>']
['i', 'k', 'i', '<eow>', '<eow>', '<eow>', '<eow>']
['u', 'm', 'a', '<eow>', '<eow>', '<eow>', '<eow>']
['i', 'y', 'a', '<eow>', '<eow>', '<eow>', '<eow>']
['u', 't', 't', 'a', '<eow>', '<eow>', '<eow>']
['u', 'k', 'i', '<eow>', '<eow>', '<eow>', '<eow>']
['w', 'a', 'i', '<eow>', '<eow>', '<eow>', '<eow>']
['x', 'e', 'i', '<eow>', '<eow>', '<eow>', '<eow>']
['h', '+', '<eow>', '<eow>',

In [140]:
for pred in trg[1:].transpose(1, 0):
    print([TGT.vocab.itos[i.item()] for i in pred])

['m', 'i', 'm', 'e', '<eow>', '<pad>', '<pad>']
['x', '+', 'y', 'e', '<eow>', '<pad>', '<pad>']
["'", 'u', '!', 't', 'a', '<eow>', '<pad>']
['p', '!', 'a', '!', 'k', 'a', '<eow>']
['k', 'i', '!', 'y', 'a', '<eow>', '<pad>']
['p', 'e', '!', 'k', 'a', '<eow>', '<pad>']
['n', 'a', 'w', 'i', '<eow>', '<pad>', '<pad>']
['w', 'i', '!', 'k', 'i', '<eow>', '<pad>']
['m', 'a', 'n', 'a', '<eow>', '<pad>', '<pad>']
['m', '+', '!', 'y', 'e', '<eow>', '<pad>']
['m', '+', 'k', '<eow>', '<pad>', '<pad>', '<pad>']
['u', 'w', 'a', '<eow>', '<pad>', '<pad>', '<pad>']
['e', 'n', 'a', '<eow>', '<pad>', '<pad>', '<pad>']
['i', 'k', 'i', '<eow>', '<pad>', '<pad>', '<pad>']
['u', 'm', 'a', '<eow>', '<pad>', '<pad>', '<pad>']
['i', 'y', 'a', '<eow>', '<pad>', '<pad>', '<pad>']
['u', '!', 't', 'a', '<eow>', '<pad>', '<pad>']
['u', 'k', 'i', '<eow>', '<pad>', '<pad>', '<pad>']
['w', 'a', 'i', '<eow>', '<pad>', '<pad>', '<pad>']
['x', 'e', 'i', '<eow>', '<pad>', '<pad>', '<pad>']
['h', '+', '<eow>', '<pad>', '<p

In [144]:
!wc -l *tsv

   864 TESTALLWORDS.tsv
  1664 train_w_words_wixarika.tsv
   167 wixarika_dev.tsv
   553 wixarika_test.tsv
   665 wixarika_train.tsv
  3913 total
