Tasks:
#### 1. Data preparation
- Download morphological segmentation data from http://turing.iimas.unam.mx/wix/static/resources/language_data.tar.bz2. You can choose which language you want to work with or even try commbining them. 


- Alternatively, if you prefer to do a machine translation task, download one of the datasets from https://www.manythings.org/anki/


- Create three .tsv files, one for each of train/dev/test partitions (if you use the MT data you will need to choose how to split the data, probably something like 70%/15%/15% would work). Once you have the data in this format, you smiply need to update the code in the "Load Data" section to load your data.


#### 2. Compare RNN Decoder (`Decoder` in the code) vs. RNN Decoder with Attention (`AttentionDecoder` in the code) 
- Train model (for ~50-100 epochs? more if time permits...) using the "Vanilla Decoder", which is the default.


- Make the necessary changes to the code (there should only be 2, there places are marked with a "TODO" comment) in order to run the same experiment with The AttentionDecoder.


- Compare the results. Do you notice any difference? Are your results similar to what was reported in the paper? Why or why not do you think this is the case?


#### 3. (Bonus 1) Implement Teacher Forcing
- Currently, in the `Seq2Seq` class's `forward()` method, there is a parameter called `teacher_forcing_ratio`, but we don't use it. "Teacher forcing" is a technique for training seq2seq models where, at each timestep, you give the decoder the correct output from the previous time step with some probability (instead of always feeding it the prediction from the previous time step, which could be wrong). Implement teacher forcing in this method. Assume `teacher_forcing_ratio` is a float between 0 and 1, and indicates the proportion of time we give the correct input to the decoder.

#### 4. (Bonus 2) Compare with RNN Transducer
- Train an RNN Transducer (from a few weeks ago) on the same data and compare the performances. 


In [15]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from torchtext.data import Field, TabularDataset, BucketIterator

import numpy as np

import random
import math
import time

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## Load Data
We will use utilities from the Pytorch package "torchtext" to easily load the data and batch it using buckets according to length (in order to minimize padding)

In [166]:
char_tokenize = lambda s: s.split()
SRC = Field(tokenize=char_tokenize, init_token='<sow>', eos_token='<eow>', lower=True)
TGT = Field(tokenize=char_tokenize, init_token='<sow>', eos_token='<eow>', lower=True)

In [174]:
#
# TODO: Update this cell to load the dataset you chose.
#
path_to_data = "../low-resource-polysynthetic-morphology/"

train_data, val_data, test_data = TabularDataset.splits(
        path=path_to_data, train='nahuatl_train.tsv',
        validation='nahuatl_dev.tsv', test='nahuatl_test.tsv', format='tsv',
        fields=[('src', SRC), ('tgt', TGT)])


SRC.build_vocab(train_data)
TGT.build_vocab(train_data)

In [175]:
#
# TODO: play with the batch size. Depending on your machine and dataset you may be able to get 
# away with much larger batches.
#
BATCH_SIZE = 8 

(train_iterator, valid_iterator, test_iterator) = BucketIterator.splits(
    (train_data, val_data, test_data),
    batch_size=BATCH_SIZE,
    device=device,
    sort_key=lambda x: len(x.src) # batch by length in order to minimize sequence padding
)

## Define Model (Encoder, Decoder, Attention Layer, and Decoder with Attention)
We define both a "standard" decoder and an attention decoder, so that we can evaluate the impact of attention

### Encoder

In [167]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):        
        embedded = self.dropout(self.embedding(src))     
        outputs, hidden = self.rnn(embedded)
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
        return outputs, hidden

### Vanilla Decoder (no attention mechanism)

In [168]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, emb_dim, dec_hid_dim, dropout):
        super().__init__()
        self.output_dim = vocab_size
        self.hid_dim = dec_hid_dim
        self.embedding = nn.Embedding(vocab_size, emb_dim)
        self.rnn = nn.GRU(emb_dim, dec_hid_dim, dropout=dropout)
        self.fc_out = nn.Linear(dec_hid_dim, vocab_size)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden):
        #
        # On the first time step, the hidden tensor 
        # (the context vector from the encoder) is only 2d, 
        # so we unsqueeze it.
        #
        if len(hidden.shape) == 2:
            hidden = hidden.unsqueeze(0)
            
        input = input.unsqueeze(0)        
        embedded = self.dropout(self.embedding(input))                
        outputs, hidden = self.rnn(embedded, hidden)
        prediction = self.fc_out(outputs.squeeze(0))        
        return prediction, hidden



### Attention

In [169]:
class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs):
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        #
        # Repeat decoder hidden state src_len times in order to concatenate it 
        # with the encoder outputs.
        #
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        # energy shape: [batch size, src len, dec hid dim]
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2)))
        # attention shape: [batch size, src len]
        attention = self.v(energy).squeeze(2)
        
        return F.softmax(attention, dim=1)

### Decoder with Attention

In [170]:
class AttentionDecoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()
        self.output_dim = output_dim
        self.attention = attention
        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
        input = input.unsqueeze(0)
        embedded = self.dropout(self.embedding(input))
        
        a = self.attention(hidden, encoder_outputs) 
        a = a.unsqueeze(1)
                
        #
        # Get weighted sum of encoder states (weighted by attention vector)
        #
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        weighted = torch.bmm(a, encoder_outputs)
        weighted = weighted.permute(1, 0, 2)
                
        rnn_input = torch.cat((embedded, weighted), dim = 2)            
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        #
        # Also feed the input embedding and the attended encoder representation 
        # to the fully connected output layer.
        #
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1))
                
        return prediction, hidden.squeeze(0)

## Putting it all together (the Seq2Seq model)

In [186]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #encoder_outputs is all hidden states of the input sequence, back and forwards
        #hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src)
                
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]

        for t in range(1, trg_len):
            
            #
            # TODO: change to accomodate the AttentionDecoder forward() call.
            #
            output, hidden = self.decoder(input, hidden, encoder_outputs)  # <- add enc outputs for attention dec
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            input = trg[t]
            #
            # TODO: (Bonus task) implement teacher forcing here
            #
            # Answer:
            # if teacher_forcing_ratio > 0:
            #     teacher_force = random.random() < teacher_forcing_ratio
            #     input = trg[t] if teacher_force else top1
            # else:
            #     input = trg[t]

        return outputs

## Training Logic

In [187]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TGT.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)

# 
# TODO:
# The following line defines the decoder as a Vanilla RNN (GRU) Decoder (i.e. no attention). 
# Your task is to update this line to use the Bahdanau decoder (AttentionDecoder). You will
# need to check out the __init__ method of AttentionDecoder to make sure you are passing it the
# appropriate args.
#
# dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, DEC_HID_DIM, DEC_DROPOUT)
# ANSWER:
dec = AttentionDecoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)


model = Seq2Seq(enc, dec, device).to(device)

def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(36, 256)
    (rnn): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): AttentionDecoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
    (embedding): Embedding(37, 256)
    (rnn): GRU(1280, 512)
    (fc_out): Linear(in_features=1792, out_features=37, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [188]:
optimizer = optim.Adam(model.parameters())
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    print("Starting training...")
    for i, batch in enumerate(iterator):
        src = batch.src
        trg = batch.tgt
        optimizer.zero_grad()
        output = model(src, trg)
        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)        
        
        loss = criterion(output, trg)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [189]:
def evaluate(model, iterator, criterion):
    model.eval()

    epoch_loss = 0    
    with torch.no_grad():
        for i, batch in enumerate(iterator):
            src = batch.src
            trg = batch.tgt
            output = model(src, trg, 0) # turn off teacher forcing
            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)
            loss = criterion(output, trg)
            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [190]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [191]:
N_EPOCHS = 50
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Starting training...
Epoch: 01 | Time: 0m 27s
	Train Loss: 2.802 | Train PPL:  16.474
	 Val. Loss: 2.546 |  Val. PPL:  12.761
Starting training...
Epoch: 02 | Time: 0m 33s
	Train Loss: 2.279 | Train PPL:   9.769
	 Val. Loss: 2.186 |  Val. PPL:   8.901
Starting training...
Epoch: 03 | Time: 0m 32s
	Train Loss: 1.952 | Train PPL:   7.042
	 Val. Loss: 1.964 |  Val. PPL:   7.126
Starting training...
Epoch: 04 | Time: 0m 33s
	Train Loss: 1.553 | Train PPL:   4.725
	 Val. Loss: 1.392 |  Val. PPL:   4.022
Starting training...
Epoch: 05 | Time: 0m 32s
	Train Loss: 1.123 | Train PPL:   3.075
	 Val. Loss: 1.103 |  Val. PPL:   3.013
Starting training...
Epoch: 06 | Time: 0m 33s
	Train Loss: 0.850 | Train PPL:   2.340
	 Val. Loss: 0.843 |  Val. PPL:   2.322
Starting training...
Epoch: 07 | Time: 0m 33s
	Train Loss: 0.695 | Train PPL:   2.004
	 Val. Loss: 0.593 |  Val. PPL:   1.810
Starting training...
Epoch: 08 | Time: 0m 32s
	Train Loss: 0.482 | Train PPL:   1.619
	 Val. Loss: 0.462 |  Val. PPL: 

KeyboardInterrupt: 

In [180]:
model.decoder

Decoder(
  (embedding): Embedding(37, 256)
  (rnn): GRU(256, 512, dropout=0.5)
  (fc_out): Linear(in_features=512, out_features=37, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)