## Tutorial - Seq2Seq model for Neural Machine Translation
This tutorial is adapted from [_NLP From Scratch: Translation with a Sequence to Sequence Network and Attention_](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html) tutorial from Pytoch documentations.

In this tutorial you will learn how to one language to another using neural network. Here, we treanslate English to French as example.

```
[KEY: > input, = target, < output]

> il est en train de peindre un tableau .
= he is painting a picture .
< he is painting a picture .

> pourquoi ne pas essayer ce vin delicieux ?
= why not try that delicious wine ?
< why not try that delicious wine ?

> vous etes trop maigre .
= you re too skinny .
< you re all alone .

```

In [7]:
#Generic imports and OS settings
import os
import csv
import glob
import matplotlib
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

In [8]:
## Requirements
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random
from tqdm import trange
import torch
import nltk 
import numpy as np

#Custom imports and device settings
from numpy import transpose
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate import meteor
from nltk.corpus import stopwords



In [27]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
teacher_forcing_ratio = 1
ITERCOUNT = 60000
MAX_LENGTH = 150 


### Part1 - Seq2Seq model

Transforming one sequence to another is possible by the simple but powerful idea of the [sequence to sequence network](https://arxiv.org/abs/1409.3215), in which two recurrent neural networks work. An encoder network condenses an input sequence into a vector, and a decoder network unfolds that vector into a new sequence.

![seq2seq](https://pytorch.org/tutorials/_images/seq2seq.png)

#### Loading data files

The data used in this tutorial contains thousands of English to French translation pairs.

This question on Open Data Stack Exchange pointed me to the open translation site https://tatoeba.org/ which has downloads available at https://tatoeba.org/eng/downloads - and better yet, someone did the extra work of splitting language pairs into individual text files here: https://www.manythings.org/anki/

The English to French pairs can be found at data/eng-fra.txt. It is a tab separated list of translation pairs.
```
I am cold.    J'ai froid.
```

We will be representing each word in a language as a one-hot vector, or giant vector of zeros except for a single one (at the index of the word). There are many more words, so the encoding vector is much larger. We will however cheat a bit and trim the data to only use a few thousand words per language.

![vocab](https://pytorch.org/tutorials/_images/word-encoding.png)

We’ll need a unique index per word to use as the inputs and targets of the networks later. To keep track of all this we will use a helper class called Lang which has word → index (word2index) and index → word (index2word) dictionaries, as well as a count of each word word2count which will be used to replace rare words later.

In [16]:
SOS_token = 0
EOS_token = 1


class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # Count SOS and EOS

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

    def stoi(self, word):
        return self.word2index[word]
    
    def itos(self, ndx):
        return self.index2word[ndx]
    
    def contains(self, word):
        if word in self.word2index:
            return True
        return False

All files are in Unicode, to simplify, we will turn Unicode characters to ASCII, make everything lowercase, and trim most punctuation.

In [17]:
# Turn a Unicode string to plain ASCII, thanks to
# https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters


def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

# RECIPE DATA PROCESSING

Pipeline is: 
1. process raw text extracts recipe pairs from the raw text folders by calling text-to-recipe
2. recipes are cleaned up with data normalisation and preprocessing
3. write-tsv writes the tsv objects with the recipe list

these TSV files are our new data

4. extract pairs pulls the data from the tsv files
5. the pairs are passed into build-language as well as into their own object
6. pairs and both languages are returned

In [18]:
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
measurements = {"c", "tsp", "tbsp", "qt", "cn", "lb", "ts", "ea", "lg", "tb"}
stop_words.update(measurements)


def trim_stopwords(text):
    output = " ".join([word for word in str(text).split() if word not in stop_words])
    output = output.lower()
    return output

def recipe_cleanup(all_recipes):
    outputs = []
    for recipe in all_recipes:
        clean_ing = trim_stopwords(recipe[0])
        clean_step = trim_stopwords(recipe[1])
        outputs.append((clean_ing, clean_step))
    return outputs

    #     - lower (10 mins)
    # - dissolve contractions (30 mins)
    # - remove stopwords (20 mins)

# test_recipe = trim_stopwords("Ours is the Question of Glory")
# print(test_recipe)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\marks\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [19]:
def text_to_recipe_processing(line):
    title = re.findall(r'Title: (.*)', line)
    ingredients = re.findall(r'ingredients: (.*)', line)
    steps = re.findall(r'ingredients: .*\n([\s\S]*)', line)
    try:
        title = title[0]
        title = re.sub(r'[^a-zA-Z ]', '', title) #remove non-alpha or underscore characters
        title = re.sub(r'\s+', ' ', title) #remove excess spaces
        ingredients = ingredients[0].replace('''\t''', " ") #replace tab with space for better formatting
        ingredients = re.sub(r'[^a-zA-Z ]', '', ingredients)
        ingredients = re.sub(r'\s+', ' ', ingredients)
        steps = steps[0].replace('''\n''', " ") #replace newline in steps with space
        steps = re.sub(r'[^a-zA-Z ]', '', steps)
        steps = re.sub(r'\s+', ' ', steps)
    except:
        return None
    return (str(title + " " + ingredients), str(steps))

def process_rawtext(path):
    print("Processing text data from {}".format(path))
    recipes = []
    files = glob.glob(path + "/*.txt")
    for file in files:
        lines = open(file, encoding='utf-8').read().strip().split("END RECIPE")
        for l in lines:
            recipe = text_to_recipe_processing(l)
            if recipe is not None:
                recipes.append(recipe)
    return recipes

def write_to_tsv(destination, recipe_list):
    with open(destination, 'w',  newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter='\t')
        for recipe in recipe_list:
            writer.writerow(recipe)

def build_language(data_paths):
    all_ingredients = []
    all_recipes = []
    for path in data_paths: #manual path list:
        recipes = extract_pairs(path)
        for r in recipes:
            all_ingredients.append(r[0])
            all_recipes.append(r[1])
    ingredient_lang = Lang("ingredients")
    recipe_lang = Lang("recipes")
    for ing in all_ingredients:
        ingredient_lang.addSentence(ing)
    for rec in all_recipes:
        recipe_lang.addSentence(rec)
    return ingredient_lang, recipe_lang

# def extract_pairs(path):
#     all_recipes = []
#     file = open(path, 'r')
#     lines = file.readlines()
#     for l in lines:
#         items = [normalizeString(s) for s in l.split("""\t""")]
#         all_recipes.append(items)
#     all_recipes = filterPairs(all_recipes)
#     return all_recipes


In [7]:
### Recipes with stopwords included
# train_recipes = process_rawtext("Cooking_Dataset/train")
# write_to_tsv("Dataset/train.tsv", train_recipes)
# test_recipes = process_rawtext("Cooking_Dataset/test")
# write_to_tsv("Dataset/test.tsv", test_recipes)
# dev_recipes = process_rawtext("Cooking_Dataset/dev")
# write_to_tsv("Dataset/dev.tsv", dev_recipes)

# ###Recipes without stopwords
# train_recipes = process_rawtext("Cooking_Dataset/train")
# cleaned_train = recipe_cleanup(train_recipes)
# write_to_tsv("Clean_Data/train.tsv", cleaned_train)
# dev_recipes = process_rawtext("Cooking_Dataset/dev")
# cleaned_dev = recipe_cleanup(dev_recipes)
# write_to_tsv("Clean_Data/dev.tsv", cleaned_dev)
# test_recipes = process_rawtext("Cooking_Dataset/test")
# cleaned_test = recipe_cleanup(test_recipes)
# write_to_tsv("Clean_Data/test.tsv", cleaned_test)

Processing text data from Cooking_Dataset/train
Processing text data from Cooking_Dataset/dev
Processing text data from Cooking_Dataset/test


To read the data file we will split the file into lines, and then split lines into pairs. The used files contain English → Other Language, so I added the `reverse` flag to reverse the pairs, in case that you want to translate from Other Language → English .

In [2]:
# def readLangs(lang1, lang2, reverse=False):
#     print("Reading lines...")

#     # Read the file and split into lines
#     file_name = 'data/%s-%s.txt' % (lang1, lang2)
#     lines = open(file_name, encoding='utf-8').\
#         read().strip().split('\n')

#     # Split every line into pairs and normalize
#     pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]

#     # Reverse pairs, make Lang instances
#     if reverse:
#         pairs = [list(reversed(p)) for p in pairs]
#         input_lang = Lang(lang2)
#         output_lang = Lang(lang1)
#     else:
#         input_lang = Lang(lang1)
#         output_lang = Lang(lang2)

#     return input_lang, output_lang, pairs


Pairs can be later extracted from the .tsv files and returned as recipes, after being filtered for passing max length and the strings normalised (just in case any stray data makes it through the cleanup phase)

In [20]:


# eng_prefixes = (
#     "i am ", "i m ",
#     "he is", "he s ",
#     "she is", "she s ",
#     "you are", "you re ",
#     "we are", "we re ",
#     "they are", "they re "
# )

def filterPair(p):
    return len(p[0].split(' ')) < MAX_LENGTH and \
        len(p[1].split(' ')) < MAX_LENGTH

def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

def extract_pairs(path):
    all_recipes = []
    file = open(path, 'r')
    lines = file.readlines()
    for l in lines:
        items = [normalizeString(s) for s in l.split("""\t""")]
        all_recipes.append(items)
    all_recipes = filterPairs(all_recipes)
    return all_recipes

The full process for preparing the data is:

- Read text file and split into lines, split lines into pairs
- Normalize text, filter by length and content
- Make word lists from sentences in pairs


In [22]:
# def prepareData(lang1, lang2, reverse=False):
#     input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
#     print("Read %s sentence pairs" % len(pairs))
#     pairs = filterPairs(pairs)
#     print("Trimmed to %s sentence pairs" % len(pairs))
#     print("Counting words...")
#     for pair in pairs:
#         input_lang.addSentence(pair[0])
#         output_lang.addSentence(pair[1])
#     print("Counted words:")
#     print(input_lang.name, input_lang.n_words)
#     print(output_lang.name, output_lang.n_words)
#     return input_lang, output_lang, pairs


# input_lang, output_lang, pairs = prepareData('eng', 'fra', True)
# print(random.choice(pairs))
dataset_path = ["Clean_Data/train.tsv", "Clean_Data/dev.tsv", "Clean_Data/test.tsv"]
input_lang, output_lang = build_language(dataset_path)
pairs = extract_pairs("Clean_Data/train.tsv")
dev_pairs = extract_pairs("Clean_Data/dev.tsv")
test_pairs = extract_pairs("Clean_Data/test.tsv")

NameError: name 'MAX_LENGTH' is not defined

In [33]:
print(pairs[0])

['marinated chicken mike peters chicken cut up cut each ts garlic powder large piece into two parts ts white pepper c oil ts marjoram c vinegar ts rosemary ts onion powder ts salt', 'salt each piece of chicken combine oil vinegar onion powder garlic powder pepper marjoram rosemary and salt place chicken in marinade and refrigerate over night turn chicken once or twice as it marinates broil chicken on grill or in broiler once or twice during the cooking process dip chicken into the marinade']


#### Data Summary Statistics

In [36]:
def pair_summary_stats(pairs, name):
    ingredients = [pairs[i][0] for i in range(len(pairs))]
    recipes = [pairs[i][1] for i in range(len(pairs))]
    ingredient_stats = [len(r) for r in ingredients]
    recipe_stats = [len(r) for r in recipes]
    ing_stats = summary_statistics(ingredient_stats)
    rec_stats = summary_statistics(recipe_stats)
    print("Maximum length of ingredients from {} set: {}".format(name, ing_stats[0]))
    print("Minimum length of ingredients from {} set: {}".format(name, ing_stats[1]))
    print("Mean length of ingredients from {} set: {}".format(name, ing_stats[2]))
    print("Maximum length of recipes from {} set: {}".format(name, rec_stats[0]))
    print("Minimum length of recipes from {} set: {}".format(name, rec_stats[1]))
    print("Mean length of recipes from {} set: {}".format(name, rec_stats[2]))

def summary_statistics(input_list):
    l_max = max(input_list)
    l_min = max(input_list)
    l_mean = sum(input_list)/len(input_list)
    return (l_max, l_min, l_mean)

In [43]:
print("Length of ingredients corpus: {}".format(input_lang.n_words))
print("Length of recipes corpus: {}".format(output_lang.n_words))
print("######")
pair_summary_stats(pairs, "train")

Length of ingredients corpus: 29726
Length of recipes corpus: 29236
######
Maximum length of ingredients from train set: 1805
Minimum length of ingredients from train set: 1805
Mean length of ingredients from train set: 203.50929768209045
Maximum length of recipes from train set: 3867
Minimum length of recipes from train set: 3867
Mean length of recipes from train set: 482.7535549960934


### MODEL DEFINITION - MODEL ONE
A Recurrent Neural Network, or RNN, is a network that operates on a sequence and uses its own output as input for subsequent steps.

A Sequence to Sequence network, or seq2seq network, or [Encoder Decoder network](https://arxiv.org/pdf/1406.1078v3.pdf), is a model consisting of two RNNs called the encoder and decoder. The encoder reads an input sequence and outputs a single vector, and the decoder reads that vector to produce an output sequence.

![seq2seq](https://pytorch.org/tutorials/_images/seq2seq.png)

Unlike sequence prediction with a single RNN, where every input corresponds to an output, the seq2seq model frees us from sequence length and order, which makes it ideal for translation between two languages.

Consider the sentence “Je ne suis pas le chat noir” → “I am not the black cat”. Most of the words in the input sentence have a direct translation in the output sentence, but are in slightly different orders, e.g. “chat noir” and “black cat”. Because of the “ne/pas” construction there is also one more word in the input sentence. It would be difficult to produce a correct translation directly from the sequence of input words.

With a seq2seq model the encoder creates a single vector which, in the ideal case, encodes the “meaning” of the input sequence into a single vector — a single point in some N dimensional space of sentences.

#### The Encoder
The encoder of a seq2seq network is a RNN that outputs some value for every word from the input sentence. For every input word the encoder outputs a vector and a hidden state, and uses the hidden state for the next input word.

![encoder](https://pytorch.org/tutorials/_images/encoder-network.png)

In [21]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.input_size = input_size
        self.embedding = nn.Embedding(input_size, hidden_size)
        # self.gru = nn.GRU(hidden_size, hidden_size)
        self.lstm = nn.LSTM(hidden_size, hidden_size)

    def forward(self, input, hidden, cell):
        embedded = self.embedding(input).view(1, 1, -1)
        output = embedded
        output, (hidden, cell) = self.lstm(output, (hidden, cell))
        return output, hidden, cell

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

#### The Decoder
The decoder is another RNN that takes the encoder output vector(s) and outputs a sequence of words to create the translation.

**Simple Decoder**

In the simplest seq2seq decoder, we only use the last output of the encoder. This last output is sometimes called the context vector as it encodes context for the entire sequence. This context vector is used as the initial hidden state of the decoder.

At every step of decoding, the decoder is given an input token and hidden state. The initial input token is the start-of-string <SOS> token, and the first hidden state is the context vector (the encoder’s last hidden state).
![decoder](https://pytorch.org/tutorials/_images/decoder-network.png)

In [12]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.lstm = nn.LSTM(hidden_size, hidden_size)
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden, cell):
        output = self.embedding(input).view(1, 1, -1)
        output = F.relu(output)
        output, (hidden, cell) = self.lstm(output, (hidden, cell))
        # output, hidden = self.gru(output, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden, cell

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)


#### Preparing Training data
To train, for each pair we will need an input tensor (indexes of the words in the input sentence) and target tensor (indexes of the words in the target sentence). While creating these vectors we will append the EOS token to both sequences.

In [None]:
def indexesFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]


def tensorFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)


def tensorsFromPair(pair, input_lang, output_lang):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, target_tensor)

#### Training the Model
To train we run the input sentence through the encoder, and keep track of every output and the latest hidden state. Then the decoder is given the `<SOS>` token as its first input, and the last hidden state of the encoder as its first hidden state.

“Teacher forcing” is the concept of using the real target outputs as each next input, instead of using the decoder’s guess as the next input. Using teacher forcing causes it to converge faster [but when the trained network is exploited, it may exhibit instability](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.378.4095&rep=rep1&type=pdf).

You can observe outputs of teacher-forced networks that read with coherent grammar but wander far from the correct translation - intuitively it has learned to represent the output grammar and can “pick up” the meaning once the teacher tells it the first few words, but it has not properly learned how to create the sentence from the translation in the first place.

Because of the freedom PyTorch’s autograd gives us, we can randomly choose to use teacher forcing or not with a simple if statement. Turn `teacher_forcing_ratio` up to use more of it.

In [14]:
teacher_forcing_ratio = 0.5


def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):
    encoder_hidden = encoder.initHidden()
    encoder_cell = encoder_hidden

    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)

    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

    loss = 0

    for ei in range(input_length):
        encoder_output, encoder_hidden, encoder_cell = encoder(
            input_tensor[ei], encoder_hidden, encoder_cell)
        encoder_outputs[ei] = encoder_output[0, 0]

    decoder_input = torch.tensor([[SOS_token]], device=device)

    decoder_hidden = encoder_hidden
    decoder_cell = decoder_hidden

    use_teacher_forcing = True 
    # if random.random() < teacher_forcing_ratio else False

    if use_teacher_forcing:
        # Teacher forcing: Feed the target as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_cell = decoder(
                decoder_input, decoder_hidden, decoder_cell)
            loss += criterion(decoder_output, target_tensor[di])
            decoder_input = target_tensor[di]  # Teacher forcing

    else:
        # Without teacher forcing: use its own predictions as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden)
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach()  # detach from history as input

            loss += criterion(decoder_output, target_tensor[di])
            if decoder_input.item() == EOS_token:
                break

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length


This is a helper function to print time elapsed and estimated time remaining given the current time and progress %.

In [23]:
import time
import math


def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))


### Plotting Helper Fn

In [24]:
import matplotlib.pyplot as plt
plt.switch_backend('agg')
import matplotlib.ticker as ticker
import numpy as np
%matplotlib inline
    
def showPlot(points):
    plt.figure()
    fig, ax = plt.subplots()
    # this locator puts ticks at regular intervals
    loc = ticker.MultipleLocator(base=0.2)
    ax.yaxis.set_major_locator(loc)
    plt.plot(points)

The whole training process looks like this:

- Start a timer
- Initialize optimizers and criterion
- Create set of training pairs
- Start empty losses array for plotting

Then we call `train` many times and occasionally print the progress (% of examples, time so far, estimated time) and average loss.



In [20]:
def validate(encoder, decoder, target_tensor, input_tensor, criterion, max_length=MAX_LENGTH):
    with torch.no_grad():
        input_length = input_tensor.size()[0]
        encoder_hidden = encoder.initHidden()
        encoder_cell = encoder_hidden
        loss = 0

        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

        for ei in range(input_length):
            encoder_output, encoder_hidden, encoder_cell = encoder(
                input_tensor[ei], encoder_hidden, encoder_cell)
            encoder_outputs[ei] = encoder_output[0, 0]

        decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS

        decoder_hidden = encoder_hidden
        decoder_cell = decoder_hidden

        decoded_words = []
        decoder_attentions = torch.zeros(max_length, max_length)

        for di in range(max_length):
            decoder_output, decoder_hidden, decoder_cell = decoder(
                decoder_input, decoder_hidden, decoder_cell)
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == EOS_token:
                decoded_words.append('<EOS>')
                break
            else:
                decoded_words.append(output_lang.index2word[topi.item()])
            try:
                loss += criterion(decoder_output, target_tensor[di])
            except:
                loss += loss.item()/len(decoded_words)
            decoder_input = topi.squeeze().detach()

        return loss/len(decoded_words)


In [21]:
def trainIters(encoder, decoder, n_iters, input_lang, output_lang, val_data, print_every=1000, plot_every=100, learning_rate=0.01):
    start = time.time()
    plot_losses = []
    val_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
    training_pairs = [tensorsFromPair(random.choice(pairs), input_lang, output_lang)
                      for i in range(n_iters)]
    validation_pairs = [tensorsFromPair(random.choice(val_data), input_lang, output_lang)
                      for i in range(n_iters)]
    criterion = nn.NLLLoss()

    for iter in trange(1, n_iters + 1):
        training_pair = training_pairs[iter - 1]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]
        loss = train(input_tensor, target_tensor, encoder,
                     decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),
                                         iter, iter / n_iters * 100, print_loss_avg))

        if iter % plot_every == 0:
            val = validate(encoder, decoder, target_tensor, input_tensor, criterion, MAX_LENGTH)
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0
    torch.save(encoder.state_dict(), "Checkpoints/Vanilla/encoder{}.pt".format(n_iters))
    torch.save(decoder.state_dict(), "Checkpoints/Vanilla/decoder{}.pt".format(n_iters))
    plotloss = np.asarray(plot_losses)
    np.save("Logs/vanilla{}loss.npy".format(n_iters), plotloss)
    plotval = np.asarray(val_losses)
    np.save("Logs/vanilla{}val.npy".format(n_iters), plotval)




In [24]:
hidden_size = 256
encoder1 = EncoderRNN(input_lang.n_words, hidden_size)
encoder1.to(device)
decoder1 = DecoderRNN(hidden_size, output_lang.n_words)
decoder1.to(device)

trainIters(encoder1, decoder1, 60000, input_lang=input_lang, output_lang=output_lang, val_data = dev_pairs, print_every=1000, plot_every=100)

  2%|▏         | 1000/60000 [02:57<3:08:00,  5.23it/s]

3m 11s (- 188m 39s) (1000 1%) 7.1113


  3%|▎         | 2000/60000 [05:49<1:53:34,  8.51it/s]

6m 4s (- 176m 13s) (2000 3%) 6.2421


  5%|▍         | 2999/60000 [08:44<3:44:38,  4.23it/s]

8m 59s (- 170m 41s) (3000 5%) 5.9338


  7%|▋         | 3999/60000 [11:33<1:54:39,  8.14it/s]

11m 48s (- 165m 21s) (4000 6%) 5.6802


  8%|▊         | 4999/60000 [14:28<3:09:30,  4.84it/s]

14m 42s (- 161m 49s) (5000 8%) 5.5680


 10%|█         | 6000/60000 [17:19<3:11:59,  4.69it/s]

17m 33s (- 158m 1s) (6000 10%) 5.4717


 12%|█▏        | 7001/60000 [20:06<2:11:19,  6.73it/s]

20m 20s (- 154m 2s) (7000 11%) 5.3060


 13%|█▎        | 8000/60000 [22:59<3:40:30,  3.93it/s]

23m 14s (- 151m 4s) (8000 13%) 5.2457


 15%|█▍        | 8999/60000 [25:49<3:31:30,  4.02it/s]

26m 4s (- 147m 42s) (9000 15%) 5.2437


 17%|█▋        | 10001/60000 [28:43<1:33:34,  8.91it/s]

28m 57s (- 144m 48s) (10000 16%) 5.1408


 18%|█▊        | 11000/60000 [31:33<2:25:24,  5.62it/s]

31m 47s (- 141m 39s) (11000 18%) 5.1474


 20%|█▉        | 11998/60000 [34:17<2:17:54,  5.80it/s]

34m 32s (- 138m 10s) (12000 20%) 5.1079


 22%|██▏       | 13000/60000 [37:03<2:43:27,  4.79it/s]

37m 18s (- 134m 51s) (13000 21%) 5.0514


 23%|██▎       | 14000/60000 [39:42<2:26:58,  5.22it/s]

39m 56s (- 131m 15s) (14000 23%) 4.9807


 25%|██▌       | 15000/60000 [42:29<1:58:58,  6.30it/s]

42m 43s (- 128m 11s) (15000 25%) 4.9369


 27%|██▋       | 16000/60000 [45:14<2:02:55,  5.97it/s]

45m 29s (- 125m 5s) (16000 26%) 4.9243


 28%|██▊       | 17000/60000 [48:01<1:47:45,  6.65it/s]

48m 16s (- 122m 5s) (17000 28%) 4.9880


 30%|███       | 18000/60000 [50:43<1:55:56,  6.04it/s]

50m 58s (- 118m 55s) (18000 30%) 4.8410


 32%|███▏      | 18999/60000 [53:27<1:55:10,  5.93it/s]

53m 42s (- 115m 53s) (19000 31%) 4.8911


 33%|███▎      | 20000/60000 [56:15<1:24:14,  7.91it/s]

56m 30s (- 113m 0s) (20000 33%) 4.8523


 35%|███▌      | 21000/60000 [58:59<1:28:50,  7.32it/s]

59m 14s (- 110m 0s) (21000 35%) 4.8347


 37%|███▋      | 22000/60000 [1:01:43<2:31:29,  4.18it/s]

61m 57s (- 107m 1s) (22000 36%) 4.7909


 38%|███▊      | 23002/60000 [1:04:24<1:23:49,  7.36it/s]

64m 39s (- 104m 0s) (23000 38%) 4.7575


 40%|████      | 24000/60000 [1:07:07<1:37:28,  6.15it/s]

67m 21s (- 101m 2s) (24000 40%) 4.7863


 42%|████▏     | 25000/60000 [1:09:48<2:14:03,  4.35it/s]

70m 2s (- 98m 4s) (25000 41%) 4.7423


 43%|████▎     | 26000/60000 [1:12:31<2:13:44,  4.24it/s]

72m 45s (- 95m 9s) (26000 43%) 4.7826


 45%|████▌     | 27001/60000 [1:15:11<1:40:53,  5.45it/s]

75m 26s (- 92m 12s) (27000 45%) 4.6486


 47%|████▋     | 28000/60000 [1:17:57<2:01:48,  4.38it/s]

78m 11s (- 89m 22s) (28000 46%) 4.7038


 48%|████▊     | 29000/60000 [1:20:38<1:49:37,  4.71it/s]

80m 53s (- 86m 28s) (29000 48%) 4.5733


 50%|█████     | 30000/60000 [1:23:26<1:49:46,  4.55it/s]

83m 41s (- 83m 41s) (30000 50%) 4.6896


 52%|█████▏    | 31000/60000 [1:26:13<1:41:25,  4.77it/s]

86m 28s (- 80m 53s) (31000 51%) 4.6497


 53%|█████▎    | 32001/60000 [1:28:55<1:05:28,  7.13it/s]

89m 10s (- 78m 1s) (32000 53%) 4.5844


 55%|█████▌    | 33001/60000 [1:31:40<1:04:38,  6.96it/s]

91m 55s (- 75m 12s) (33000 55%) 4.6772


 57%|█████▋    | 34000/60000 [1:34:23<1:14:06,  5.85it/s]

94m 37s (- 72m 21s) (34000 56%) 4.6765


 58%|█████▊    | 35001/60000 [1:37:08<57:29,  7.25it/s]  

97m 23s (- 69m 33s) (35000 58%) 4.6625


 60%|██████    | 36000/60000 [1:39:52<1:10:11,  5.70it/s]

100m 6s (- 66m 44s) (36000 60%) 4.6068


 62%|██████▏   | 37000/60000 [1:42:42<53:13,  7.20it/s]  

102m 56s (- 63m 59s) (37000 61%) 4.5761


 63%|██████▎   | 38001/60000 [1:45:25<1:01:22,  5.97it/s]

105m 40s (- 61m 10s) (38000 63%) 4.5844


 65%|██████▌   | 39000/60000 [1:48:08<57:53,  6.05it/s]  

108m 22s (- 58m 21s) (39000 65%) 4.5809


 67%|██████▋   | 40000/60000 [1:50:58<50:38,  6.58it/s]  

111m 13s (- 55m 36s) (40000 66%) 4.5315


 68%|██████▊   | 41001/60000 [1:53:44<39:42,  7.97it/s]  

113m 58s (- 52m 49s) (41000 68%) 4.5606


 70%|██████▉   | 41999/60000 [1:56:30<50:01,  6.00it/s]  

116m 44s (- 50m 2s) (42000 70%) 4.5499


 72%|███████▏  | 43000/60000 [1:59:10<35:16,  8.03it/s]  

119m 25s (- 47m 12s) (43000 71%) 4.6280


 73%|███████▎  | 44001/60000 [2:01:54<38:11,  6.98it/s]  

122m 8s (- 44m 24s) (44000 73%) 4.5754


 75%|███████▌  | 45000/60000 [2:04:42<32:41,  7.65it/s]  

124m 57s (- 41m 39s) (45000 75%) 4.5101


 77%|███████▋  | 46000/60000 [2:07:24<40:28,  5.77it/s]  

127m 39s (- 38m 51s) (46000 76%) 4.4837


 78%|███████▊  | 47000/60000 [2:10:11<42:14,  5.13it/s]  

130m 26s (- 36m 4s) (47000 78%) 4.4957


 80%|████████  | 48000/60000 [2:12:57<37:55,  5.27it/s]  

133m 12s (- 33m 18s) (48000 80%) 4.4488


 82%|████████▏ | 49000/60000 [2:15:42<29:59,  6.11it/s]

135m 57s (- 30m 31s) (49000 81%) 4.3552


 83%|████████▎ | 49999/60000 [2:18:29<28:22,  5.88it/s]

138m 44s (- 27m 44s) (50000 83%) 4.4490


 85%|████████▌ | 51000/60000 [2:21:12<25:19,  5.92it/s]

141m 27s (- 24m 57s) (51000 85%) 4.4810


 87%|████████▋ | 52000/60000 [2:23:59<31:48,  4.19it/s]

144m 14s (- 22m 11s) (52000 86%) 4.4051


 88%|████████▊ | 53000/60000 [2:26:47<22:19,  5.22it/s]

147m 1s (- 19m 25s) (53000 88%) 4.4473


 90%|█████████ | 54000/60000 [2:29:31<22:00,  4.54it/s]

149m 46s (- 16m 38s) (54000 90%) 4.4105


 92%|█████████▏| 55000/60000 [2:32:14<15:02,  5.54it/s]

152m 28s (- 13m 51s) (55000 91%) 4.5049


 93%|█████████▎| 56000/60000 [2:34:58<12:29,  5.33it/s]

155m 13s (- 11m 5s) (56000 93%) 4.3908


 95%|█████████▌| 57000/60000 [2:37:46<05:47,  8.63it/s]

158m 0s (- 8m 18s) (57000 95%) 4.3742


 97%|█████████▋| 58000/60000 [2:40:29<08:27,  3.94it/s]

160m 43s (- 5m 32s) (58000 96%) 4.4047


 98%|█████████▊| 59000/60000 [2:43:07<02:10,  7.69it/s]

163m 22s (- 2m 46s) (59000 98%) 4.3501


100%|██████████| 60000/60000 [2:45:54<00:00,  6.03it/s]

166m 9s (- 0m 0s) (60000 100%) 4.4357





#### Plotting results
Plotting is done with matplotlib, using the array of loss values `plot_losses` saved while training.

#### Evaluation
Evaluation is mostly the same as training, but there are no targets so we simply feed the decoder’s predictions back to itself for each step. Every time it predicts a word we add it to the output string, and if it predicts the EOS token we stop there. We also store the decoder’s attention outputs for display later.

In [27]:
def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):
    with torch.no_grad():
        input_tensor = tensorFromSentence(input_lang, sentence)
        input_length = input_tensor.size()[0]
        encoder_hidden = encoder.initHidden()
        encoder_cell = encoder_hidden

        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

        for ei in range(input_length):
            encoder_output, encoder_hidden, encoder_cell = encoder(
                input_tensor[ei], encoder_hidden, encoder_cell)
            encoder_outputs[ei] = encoder_output[0, 0]

        decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS

        decoder_hidden = encoder_hidden
        decoder_cell = decoder_hidden

        decoded_words = []
        decoder_attentions = torch.zeros(max_length, max_length)

        for di in range(max_length):
            decoder_output, decoder_hidden, decoder_cell = decoder(
                decoder_input, decoder_hidden, decoder_cell)
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == EOS_token:
                decoded_words.append('<EOS>')
                break
            else:
                decoded_words.append(output_lang.index2word[topi.item()])

            decoder_input = topi.squeeze().detach()

        return decoded_words

We can evaluate random sentences from the training set and print out the input, target, and output to make some subjective quality judgements:

In [25]:
def evaluateRandomly(encoder, decoder, n=10):
    for i in range(n):
        pair = random.choice(pairs)
        print('>', pair[0])
        print('=', pair[1])
        output_words= evaluate(encoder, decoder, pair[0])
        output_sentence = ' '.join(output_words)
        print('<', output_sentence)
        print('')

#### Training and Evaluating
With all these helper functions in place (it looks like extra work, but it makes it easier to run multiple experiments) we can actually initialize a network and start training.

Remember that the input sentences were heavily filtered. For this small dataset we can use relatively small networks of 256 hidden nodes and a single GRU layer. After about 40 minutes on a MacBook CPU we’ll get some reasonable results.

In [28]:
evaluateRandomly(encoder1, decoder1)

> tomatoleek soup ripe red tomatoes fresh parsley large leek garlic cloves minced olive oil oz tomato paste dry red wine dill hungarian paprika marjoram thyme salt pepper taste
= cut tomatoes quarters parsley process well pureed dice rest tomatoes set aside slice white part leek slices chop tender green leaves discard tougher part leaves wash carefully put leeks large pot along reserved green leaves garlic olive oil cover cups stock bring boil lower heat simmer minutes add pureed diced tomatoes add rest ingredients simmer low heat minutes chill serve nava atlas vegetariana
< heat oil heavy large skillet add onion garlic saute tender add tomatoes cook stirring frequently minutes add tomatoes tomato paste cook stirring occasionally add tomatoes tomato paste cook stirring occasionally add tomatoes cook another minutes add tomatoes cook another minutes add salt pepper taste simmer uncovered minutes add salt pepper taste simmer uncovered minutes add salt pepper taste simmer uncovered minute

## MODEL TWO - ATTENTION SEQ2SEQ

The only modification we make to the second baseline from model 1 to model 2 is to replace the default decoder with a more computationally nuanced Attention Decoder

Attention allows the decoder network to “focus” on a different part of the encoder’s outputs for every step of the decoder’s own outputs. First we calculate a set of attention weights. These will be multiplied by the encoder output vectors to create a weighted combination. The result (called attn_applied in the code) should contain information about that specific part of the input sequence, and thus help the decoder choose the right output words.

Calculating the attention weights is done with another feed-forward layer `attn`, using the decoder’s input and hidden state as inputs. Because there are sentences of all sizes in the training data, to actually create and train this layer we have to choose a maximum sentence length (input length, for encoder outputs) that it can apply to. Sentences of the maximum length will use all the attention weights, while shorter sentences will only use the first few.

![encoder-attn](https://pytorch.org/tutorials/_images/attention-decoder-network.png)

In [28]:
class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size, self.output_size)
        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size)

    def forward(self, input, hidden, cell, encoder_outputs): #param encoder_outputs: necessary for computing attention from decoder to all encoder values
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)

        attn_weights = F.softmax(
            self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)
        attn_applied = torch.bmm(attn_weights.unsqueeze(0),
                                 encoder_outputs.unsqueeze(0))

        output = torch.cat((embedded[0], attn_applied[0]), 1)
        output = self.attn_combine(output).unsqueeze(0)

        output = F.relu(output)
        output, (hidden, cell) = self.lstm(output, (hidden, cell))
        output, hidden = self.gru(output, hidden)

        output = F.log_softmax(self.out(output[0]), dim=1)
        return output, hidden, cell, attn_weights

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)


#### Attention Training Method

This function is particularly important since it is called back to on several occasions, both by the attention-decoder and by the later paired auto-encoder experiment.

In [29]:
teacher_forcing_ratio = 1


def train_attn(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):
    encoder_hidden = encoder.initHidden()
    encoder_cell = encoder_hidden

    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)

    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

    loss = 0

    for ei in range(input_length):
        encoder_output, encoder_hidden, encoder_cell = encoder(
            input_tensor[ei], encoder_hidden, encoder_cell)
        encoder_outputs[ei] = encoder_output[0, 0]

    decoder_input = torch.tensor([[SOS_token]], device=device)

    decoder_hidden = encoder_hidden
    decoder_cell = decoder_hidden

    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    if use_teacher_forcing:
        # Teacher forcing: Feed the target as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_cell, decoder_attention = decoder(
                decoder_input, decoder_hidden, decoder_cell, encoder_outputs)
            loss += criterion(decoder_output, target_tensor[di])
            decoder_input = target_tensor[di]  # Teacher forcing

    else:
        # Without teacher forcing: use its own predictions as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_cell, decoder_attention = decoder(
                decoder_input, decoder_hidden, decoder_cell, encoder_outputs)
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach()  # detach from history as input

            loss += criterion(decoder_output, target_tensor[di])
            if decoder_input.item() == EOS_token:
                break

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length


In [30]:
def validation_attention(encoder, decoder, input_tensor, target_tensor, criterion, max_length = MAX_LENGTH):
    with torch.no_grad():
        input_length = input_tensor.size()[0]
        encoder_hidden = encoder.initHidden()
        encoder_cell = encoder_hidden
        loss = 0 

        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

        for ei in range(input_length):
            encoder_output, encoder_hidden, encoder_cell = encoder(input_tensor[ei],
                                                     encoder_hidden, encoder_cell)
            encoder_outputs[ei] += encoder_output[0, 0]

        decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS

        decoder_hidden = encoder_hidden
        decoder_cell = decoder_hidden

        decoded_words = []
        decoder_attentions = torch.zeros(max_length, max_length)

        for di in range(max_length):
            decoder_output, decoder_hidden, decoder_cell, decoder_attention = decoder(
                decoder_input, decoder_hidden, decoder_cell, encoder_outputs)
            decoder_attentions[di] = decoder_attention.data
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == EOS_token:
                decoded_words.append('<EOS>')
                break
            else:
                decoded_words.append(output_lang.index2word[topi.item()])
            try:
                loss += criterion(decoder_output, target_tensor[di])
            except:
                loss += loss.item()/len(decoded_words)
            decoder_input = topi.squeeze().detach()

        return loss.item()/len(decoded_words)

In [31]:
def test_attention(encoder, decoder, sentence, max_length=MAX_LENGTH):
    with torch.no_grad():
        input_tensor = tensorFromSentence(input_lang, sentence)
        input_length = input_tensor.size()[0]
        encoder_hidden = encoder.initHidden()
        encoder_cell = encoder_hidden
        loss = 0 

        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

        for ei in range(input_length):
            encoder_output, encoder_hidden, encoder_cell = encoder(input_tensor[ei],
                                                     encoder_hidden, encoder_cell)
            encoder_outputs[ei] += encoder_output[0, 0]

        decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS

        decoder_hidden = encoder_hidden
        decoder_cell = decoder_hidden

        decoded_words = []
        decoder_attentions = torch.zeros(max_length, max_length)

        for di in range(max_length):
            decoder_output, decoder_hidden, decoder_cell, decoder_attention = decoder(
                decoder_input, decoder_hidden, decoder_cell, encoder_outputs)
            decoder_attentions[di] = decoder_attention.data
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == EOS_token:
                decoded_words.append('<EOS>')
                break
            else:
                decoded_words.append(output_lang.index2word[topi.item()])

            decoder_input = topi.squeeze().detach()

        return decoded_words, decoder_attentions[:di + 1]

In [32]:
def evaluateRandomly_attn(encoder, decoder, n=10):
    for i in range(n):
        pair = random.choice(pairs)
        print('>', pair[0])
        print('=', pair[1])
        output_words, attention= test_attention(encoder, decoder, pair[0])
        output_sentence = ' '.join(output_words)
        print('<', output_sentence)
        print('')

In [86]:
evaluateRandomly_attn(encoder=encoder, decoder=attn_decoder,n=1)

> chinese sauce looo basic sauce dark soy sauce ginger root thin soy sauce five whole flowerets star sugar anise water peanut oil dried chili peppers sl cloves garlic split
= add ingredients saucepan boil minutes refrigerate sauce simmer foods sauce foods absorb salty taste may need add soy sauce compensate basic sauce used slowsimmer number foods
< combine ingredients except oil wok large skillet wok heat oil hot add oil hot add oil oil hot add oil wok oil hot add oil hot oil hot add garlic cook minutes side turn heat add oil wok add garlic cook seconds add onion cook seconds add water soy sauce stir well serve <EOS>



In [87]:
def trainIters_attn(encoder, decoder, n_iters, input_lang, output_lang, print_every=1000, plot_every=100, learning_rate=0.01):
    plot_losses = []
    validation_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
    training_pairs = [tensorsFromPair(random.choice(pairs) , input_lang, output_lang)
                      for i in range(n_iters)]
    criterion = nn.NLLLoss()

    for iter in trange(1, n_iters + 1):
        training_pair = training_pairs[iter - 1]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]

        loss = train_attn(input_tensor, target_tensor, encoder,
                     decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print("Loss: {}".format(print_loss_avg))

        if iter % plot_every == 0:
            valloss = validation_attention(encoder, decoder, input_tensor, target_tensor, criterion, MAX_LENGTH)
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            validation_losses.append(valloss)
            plot_loss_total = 0
    torch.save(encoder.state_dict(), "Checkpoints/Attention/attnencoder{}.pt".format(n_iters))
    torch.save(decoder.state_dict(), "Checkpoints/Attention/attndecoder{}.pt".format(n_iters))
    plotloss = np.asarray(plot_losses)
    validloss = np.asarray(validation_losses)
    np.save("Logs/attention{}loss.npy".format(n_iters), plotloss)
    np.save("Logs/attention{}val.npy".format(n_iters), validloss)


In [92]:
hidden_size = 256
encoder = EncoderRNN(input_lang.n_words, hidden_size).to(device)
attn_decoder = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(device)

trainIters_attn(encoder, attn_decoder, 3, input_lang, output_lang, print_every=1, plot_every=1)

  0%|          | 0/3 [00:00<?, ?it/s]

Loss: 10.439402988978795


 33%|███▎      | 1/3 [00:00<00:01,  1.37it/s]

Loss: 10.445953369140625


 67%|██████▋   | 2/3 [00:01<00:00,  1.78it/s]

Loss: 10.431387271521226


100%|██████████| 3/3 [00:01<00:00,  1.64it/s]


In [39]:
evaluateRandomly_attn(encoder, attn_decoder)

> havregrynskage oat cakes butter unsalted oatmeal instant sugar granulated corn syrup white
= melt butter skillet stir sugar add oatmeal cook minutes stirring occasionally oatmeal golden brown remove heat stir corn syrup rinse custard cups muffin tins cold water shake excess moisture pack bottoms sides oatmeal mixture dividing equally refrigerate least hours loosen cakes running knife around edges gently slide serve cold buttermilk soup
< mix ingredients together store airtight container <EOS>

> herman cinnamon rolls herman salt baking powder flour soda oil margerine stick butter melted brown sugar nuts
= mix herman salt baking powder flour soda oil form dough knead lightly roll inch thickness floured surface spread soft margerine sprinkle cinnamon sugar roll jelly roll cut inch slices spread topping bottom x place roll slices top flat side bake oven minutes remove immediately overturning cookie sheet combine stick melted margerine brown sugar nuts
< preheat oven f grease muffin tins

#### Visualizing Attention
A useful property of the attention mechanism is its highly interpretable outputs. Because it is used to weight specific encoder outputs of the input sequence, we can imagine looking where the network is focused most at each time step.

You could simply run `plt.matshow(attentions)` to see attention output displayed as a matrix, with the columns being input steps and rows being output steps:

In [None]:
%matplotlib inline
output_words, attentions = test_attention(
    encoder, attn_decoder, "je suis trop froid .")
plt.matshow(attentions.numpy())

: 

For a better viewing experience we will do the extra work of adding axes and labels:

In [None]:
def showAttention(input_sentence, output_words, attentions):
    # Set up figure with colorbar
    fig = plt.figure()
    ax = fig.add_subplot(111)
    cax = ax.matshow(attentions.numpy(), cmap='bone')
    fig.colorbar(cax)

    # Set up axes
    ax.set_xticklabels([''] + input_sentence.split(' ') +
                       ['<EOS>'], rotation=90)
    ax.set_yticklabels([''] + output_words)

    # Show label at every tick
    ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

    plt.show()


def evaluateAndShowAttention(input_sentence):
    output_words, attentions = test_attention(
        encoder, attn_decoder, input_sentence)
    print('input =', input_sentence)
    print('output =', ' '.join(output_words))
    showAttention(input_sentence, output_words, attentions)


evaluateAndShowAttention("elle a cinq ans de moins que moi .")

evaluateAndShowAttention("elle est trop petit .")

evaluateAndShowAttention("je ne crains pas de mourir .")

evaluateAndShowAttention("c est un jeune directeur plein de talent .")

: 

## MODEL 3 - AUTOENCODERS

The approach here is to 'warm up' the encoders and decoders each by training autoencoders (ie. encoder-decoders training their weights on the same data) for both the ingredient list and the steps. We train each for [EPOCH], before continuing the training by loading the state dictionaries, pairing the ingredient encoder and recipe attention decoder, and continuing to fine tune the weights. 

In [25]:
#First create new data represent duplicated pairs
def re_pair(paired_data):
    sources = []
    targets = []
    for i in paired_data:
        sources.append([i[0], i[0]])
        targets.append([i[1], i[1]])
    return sources, targets

training_pairs = extract_pairs("Clean_Data/train.tsv")
train_src, train_trg = re_pair(training_pairs)

#We can re-use the single-epoch training logic from MODEL TWO but will need to modify the training loop slightly to accept new data and of course save the states. We also do not need to plot losses.

def AE_training_loop(pairs, name, encoder, decoder, n_iters, input_lang, output_lang, print_every=1000, plot_every=100, learning_rate=0.01):
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
    training_pairs = [tensorsFromPair(random.choice(pairs), input_lang, output_lang)
                      for i in range(n_iters)]
    criterion = nn.NLLLoss()
 
    for iter in trange(1, n_iters + 1):
        training_pair = training_pairs[iter - 1]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]

        loss = train_attn(input_tensor, target_tensor, encoder,
                     decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print("Average Loss: {}".format(print_loss_avg))
    torch.save(encoder.state_dict(), "Checkpoints/AE/{}_encoder_{}.pt".format(name, n_iters))
    torch.save(decoder.state_dict(), "Checkpoints/AE/{}_decoder_{}.pt".format(name, n_iters))
    torch.save(decoder_optimizer.state_dict(), "Checkpoints/AE/{}_dec_optim_{}.pt".format(name, n_iters)) #Also necessary to save optimiser state dict to resume training later
    torch.save(encoder_optimizer.state_dict(), "Checkpoints/AE/{}_enc_optim_{}.pt".format(name, n_iters))


#Then, train target autoencoder - this can re-use the training function




#### Autoencoder Training

We train both encoders and decoders on the duplicates dataset

In [40]:
hidden_size = 256

ae_ing_encoder = EncoderRNN(input_lang.n_words, hidden_size).to(device)
ae_ing_decoder = AttnDecoderRNN(hidden_size, input_lang.n_words, dropout_p=0.1).to(device)
AE_training_loop(train_src, "ingredients", ae_ing_encoder, ae_ing_decoder, input_lang=input_lang, output_lang=input_lang, n_iters = 20000)

ae_rec_encoder = EncoderRNN(output_lang.n_words, hidden_size).to(device)
ae_rec_decoder = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(device)
AE_training_loop(train_trg, "recipes", ae_rec_encoder, ae_rec_decoder, input_lang=output_lang, output_lang=output_lang, n_iters=20000)

  5%|▌         | 1000/20000 [03:01<1:05:53,  4.81it/s]

Average Loss: 7.288514762552162


 10%|▉         | 1999/20000 [06:01<46:13,  6.49it/s]  

Average Loss: 6.331298516426755


 15%|█▌        | 3000/20000 [08:57<54:10,  5.23it/s]  

Average Loss: 6.0358120431411955


 20%|██        | 4001/20000 [11:56<50:38,  5.27it/s]  

Average Loss: 5.81758999001565


 25%|██▌       | 5001/20000 [14:50<35:20,  7.07it/s]  

Average Loss: 5.680114111571367


 30%|███       | 6000/20000 [17:48<34:07,  6.84it/s]  

Average Loss: 5.569747595997876


 35%|███▌      | 7001/20000 [20:50<36:25,  5.95it/s]  

Average Loss: 5.543540254899497


 40%|████      | 8000/20000 [23:47<45:22,  4.41it/s]  

Average Loss: 5.416150746459968


 45%|████▌     | 9001/20000 [26:38<25:47,  7.11it/s]  

Average Loss: 5.309886947387344


 50%|█████     | 10001/20000 [29:34<25:07,  6.63it/s] 

Average Loss: 5.220870566131888


 55%|█████▌    | 11001/20000 [32:40<20:47,  7.22it/s]

Average Loss: 5.157565733951362


 60%|█████▉    | 11999/20000 [35:39<23:11,  5.75it/s]

Average Loss: 5.1340026351606145


 65%|██████▍   | 12999/20000 [38:35<23:13,  5.03it/s]

Average Loss: 5.008685103019934


 70%|███████   | 14000/20000 [41:28<13:54,  7.19it/s]

Average Loss: 5.002209836913369


 75%|███████▌  | 15000/20000 [44:21<12:36,  6.61it/s]

Average Loss: 4.921798479977218


 80%|███████▉  | 15999/20000 [47:12<09:09,  7.27it/s]

Average Loss: 4.874630419052859


 85%|████████▌ | 17001/20000 [50:13<10:22,  4.82it/s]

Average Loss: 4.883381383576844


 90%|█████████ | 18001/20000 [53:08<04:36,  7.23it/s]

Average Loss: 4.813930044489898


 95%|█████████▌| 19001/20000 [56:04<02:23,  6.95it/s]

Average Loss: 4.723309008001425


100%|██████████| 20000/20000 [59:02<00:00,  5.65it/s]

Average Loss: 4.730463602211124





TypeError: format() argument 2 must be str, not int

Then, staple the warmed up EncoderRNN and DecoderRNN together in new training iteration, and save them as completed Encoder/Decoder pair.

In [None]:
#Instantiate models
ingredient_encoder = EncoderRNN(input_lang.n_words, hidden_size).to(device)
recipe_decoder = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(device)

#Load model state dictionaries
ingredient_encoder.load_state_dict(torch.load("Checkpoints/AE/ingredients_encoder_20000.pt"))
recipe_decoder.load_state_dict(torch.load("Checkpoints/AE/recipe_decoder_20000.pt"))

#Instantiate optimisers
encoder_optimizer = optim.SGD(ingredient_encoder.parameters(), lr=0.01)
decoder_optimizer = optim.SGD(recipe_decoder.parameters(), lr=0.01)

#Load optimiser state dictionaries
encoder_optimizer.load_state_dict(torch.load("Checkpoints/AE/ingredients_enc_optim_20000.pt"))
decoder_optimizer.load_state_dict(torch.load("Checkpoints/AE/recipe_dec_optim_20000.pt"))

def AE_fine_tune(encoder, decoder, encoder_optim, decoder_optim, input_lang, output_lang, n_iters, print_every=1000, plot_every=100, learning_rate=0.01):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    training_pairs = [tensorsFromPair(random.choice(pairs), input_lang, output_lang)
                      for i in range(n_iters)]
    criterion = nn.NLLLoss()

    for iter in trange(1, n_iters + 1):
        training_pair = training_pairs[iter - 1]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]

        loss = train_attn(input_tensor, target_tensor, encoder,
                     decoder, encoder_optim, decoder_optim, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),
                                         iter, iter / n_iters * 100, print_loss_avg))

        if iter % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0
    torch.save(encoder.state_dict(), "Checkpoints/AE/encoder_fine_tune_{}.pt".format(n_iters))
    torch.save(decoder.state_dict(), "Checkpoints/AE/decoder_fine_tune_{}.pt".format(n_iters))
    plotloss = np.asarray(plot_losses)
    np.save("Logs/autoencoder{}loss.npy".format(n_iters), plotloss)

## MODEL FOUR - ALCHEMY

This model is so named due to the fact that it, like alchemy, is a step or two removed from cooking, but fun to experiment with all the same.

Here, I build on the work of [PAPER] to create a lightweight heuristic filter for the text generation. Here, I persist the best Seq2Seq model from the previous three (based on its loss metric) as the underlying model. Then, when sampling from the model, I apply a heuristic beam search. This creates a pair of constraints to be satisfied by each line of generated text. Whenever an *ingredient* would be generated, it is checked against the list of ingredients in the ingredients input list, with a small loss reward A more nuanced version of this method might follow Kiddon et. al.'s Neural Checklist Models. Because the Attention weights on the checklists are learned, the importance of the checklist weights in generation can also be learned (here, PyTorch's autograd feature is an elegant way of implementing this). I instead use the heuristic to preferentially persist any beams whose generation meets the constraint over beams which do not. For example:

Ingredients:
- chicken, paprika, onion, garlic

Current generation:
- Add the [ ]

Beams (k = 2):
- onion (loss = 5.6)
- carrot (loss = 5.1)
- beans (loss = 4.6)
- capers (loss = 7.7)

Alchemy will select onion, since it meets the constraint despite having a higher loss, and beans, to persist into the next two beams (with the sentence candidates thus being "Add the onions" and "Add the beans"). This is very lightweight since the heuristic can be applied purely at time of inference, and does not add to the model's training time whatsoever.

In [34]:
#Order of operations for alchemy, which receives two dictionaries representing the results of each beam
#for each dictionary, need to process the OUTPUT to create two new candidate dictionaries, 
def alchemy(k:int, candidates, condition, ingredient_lang:Lang, output_lang):
    unpacked_candidates = []
    for c in candidates:
        vec = c["out"] #Access output vector
        topv, topi = vec.data.topk(k) #Extract top k losses and indices
        for i in range(k):
            decoder_input = topi[k].squeeze().detach() #New decoder input on candidate vector will
            loss = topv[k].item()
            word = topi[k].item()
            new_sentence = c["sentence"]
            new_sentence.append(output_lang.index2word(word))
            new_loss = c["avg_loss"] + loss
            new_candidate = {"in": decoder_input, "hidden":c["hidden"], "cell":c["cell"], "sentence":new_sentence, "avg_loss": new_loss}
            unpacked_candidates.append(new_candidate)
    unpacked_candidates = sorted(unpacked_candidates, key=lambda d:d["avg_loss"])
    meets_cond = []
    cond_na = []
    fails_cond = []
    for c in unpacked_candidates:
        key_word = c["sentence"][-1]
        if ingredient_lang.contains(key_word):
            if key_word in condition:
                meets_cond.append(c)
                continue
            fails_cond.append(c)
            continue
        cond_na.append(c)
    outputs = meets_cond + cond_na + fails_cond
    stop_clause = False
    if "<EOS>" in outputs[0]["sentence"]:
        if "<EOS>" in outputs[1]["sentence"]:
            stop_clause = True
    return [outputs[0], outputs[1]], stop_clause

In [35]:
def alchemy_inference(encoder, decoder, sentence, input_lang, output_lang, k = 2, max_length=MAX_LENGTH):
    with torch.no_grad():
        condition = sentence.split()
        input_tensor = tensorFromSentence(input_lang, sentence)
        input_length = input_tensor.size()[0]
        encoder_hidden = encoder.initHidden()
        encoder_cell = encoder_hidden
        loss = 0 

        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

        for ei in range(input_length):
            encoder_output, encoder_hidden, encoder_cell = encoder(input_tensor[ei],
                                                     encoder_hidden, encoder_cell)
            encoder_outputs[ei] += encoder_output[0, 0]

        decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS

        decoder_hidden = encoder_hidden
        decoder_cell = decoder_hidden

        decoded_words = []
        decoder_attentions = torch.zeros(max_length, max_length)
        beams = [{"in":decoder_input, "hidden":decoder_hidden, "cell":decoder_cell, "sentence":["<SOS>"], "avg_loss":0} ]
        #case: need to overwrite
            # set beam[1][3] = candidate[3]
        # beams = (decoder_input, decoder_hidden, decoder_cell), (decoder_input, decoder_hidden, decoder_cell)]
        # sentence_history = {1:['<SOS>'], 2:['<SOS>']}
        for di in range(max_length):
            candidates = []
            for beam in beams:
                decoder_output, decoder_hidden, decoder_cell, decoder_attention = decoder(
                    beam['in'], beam['hidden'], beam['cell'], encoder_outputs)
                decoder_attentions[di] = decoder_attention.data
                c = {"out": decoder_output, "hidden": decoder_hidden, "cell": decoder_cell, "sentence": beam["sentence"], "avg_loss": beam["avg_loss"]}
                candidates.append(c)
                #candidates list looks like:
                #[ (out, hidden, cell, sentence), (out, hidden, cell, sentence)]
                # sort on output
                # get TWO candidates, which are going to be the new beams

                # topv, topi = decoder_output.data.topk(1)
                # if topi.item() == EOS_token: # Here down to next comment is just logic for generating and appending words.
                #     decoded_words.append('<EOS>')
                #     break
                # else:
                #     decoded_words.append(output_lang.index2word[topi.item()])
            beams, stop_clause = alchemy(k, candidates, condition, input_lang, output_lang)
            if stop_clause:
                break

            # decoder_input = topi.squeeze().detach()  # New decoder input given topk

        return beams[0]["sentence"]

In [13]:
temp = [{"hi":1}, {"hi":6}, {"hi":3}, {"hi":10}]
temp = sorted(temp, key=lambda d:d["hi"])
print(temp)

[{'hi': 1}, {'hi': 3}, {'hi': 6}, {'hi': 10}]


## Quantitative Evaluation

Here we define metrics for the ingredient recall, extra ingredients added, and use NLTK to calculate the BLEU and METEOR scores.

In [43]:
encoder = EncoderRNN(input_lang.n_words, hidden_size)
decoder = DecoderRNN(hidden_size, output_lang.n_words)
encoder.load_state_dict(torch.load("Checkpoints/Vanilla/encoder35000"))
decoder.load_state_dict(torch.load("Checkpoints/Vanilla/decoder35000"))
encoder.to(device)
decoder.to(device)

testing_pairs = extract_pairs("Dataset/test.tsv")

def evaluateModel1(encoder, decoder, data, n):
    for i in range(n):
        pair = random.choice(data)
        print('>', pair[0])
        print('=', pair[1])
        output_words= evaluate(encoder, decoder, pair[0])
        output_sentence = ' '.join(output_words)
        overlap = calculate_metrics(pair[0], prediction=output_sentence)
        print('<', output_sentence)
        print("BLEU: {}".format(overlap))
        print('')

def calculate_metrics(ground_truth, prediction):
    ref_tokens = ground_truth.split()
    ref_tokens = [ref_tokens]
    candidate_tokens = prediction.split()
    #TODO: implement METEOR, %Rec, %Spare
    bleu = sentence_bleu(ref_tokens, candidate_tokens)
    # meteor = 
    recall = ingredient_recall(ground_truth, prediction)
    extra = extra_ingredients(ground_truth, prediction, input_lang) # TODO: pass in the ingredient lang here
    # extra_ingredients = 
    return (bleu, recall, extra)

#function to find the fraction of ingredients in the ground truth list which were included in the generated recipe
def ingredient_recall(ground_truth, prediction):
    recalled = 0
    gt = ground_truth.split()
    for i in gt:
        if i in prediction:
            recalled += 1
    return recalled/len(gt)

#function to find the fraction of ingredients which were wrongly generated (ie. not in the recipe specification)
def extra_ingredients(ground_truth, prediction, ingredient_lang):
    predicted_ingredients = [word for word in prediction.split() if ingredient_lang.contains(word)]
    counting = 0
    for ing in predicted_ingredients:
        if ing not in ground_truth:
            counting += 1
    return counting/len(predicted_ingredients)
        


evaluateModel1(encoder, decoder, testing_pairs, n=10)


> butterscotch brownies c butter ts baking powder c firmly packed light brown ts salt sugar ts vanilla extract egg c chopped walnuts c sifted flour
= preheat oven to of melt butter in saucepan over low heat remove from heat stir in sugar mix until well blended cool stir in egg sift flour baking powder and salt together in bowl add to butter mixture blend well stir in vanilla and walnuts pour batter into greased and floured square pan bake for minutes cut into squares while still warm


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


< preheat oven to f grease a x baking dish combine flour sugar baking powder salt and salt in a large mixing bowl mix well add eggs one at a time beating well after each addition stir in vanilla add vanilla and mix well add eggs one at a time beating well after each addition stir in vanilla add vanilla and mix well add eggs one at a time beating well after each addition stir in vanilla add vanilla and mix well add eggs one at a time beating well after each addition stir in vanilla add vanilla and mix well add eggs one at a time beating well after each addition stir in vanilla add vanilla and mix well add eggs one at a time beating well after each addition stir in vanilla add vanilla and mix well add eggs one at a time beating well after each addition stir in vanilla add vanilla and mix well add eggs one at a time beating well after each addition stir in vanilla add vanilla and mix well add eggs one at a time beating well after each addition stir in vanilla add vanilla and mix well add 

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


< in a large skillet saute the onion and garlic in the oil until the onions are translucent add the onion and saute until the onion is translucent add the onion and saute until the onions are translucent add the onion and cook until the onions are tender add the onion and cook until the onions are tender add the onion and cook until the onions are tender add the onion and cook until the onions are tender add the onion and cook until the onions are tender add the onion and cook until the onions are tender add the onion and cook until the onions are tender add the onion and cook until the onions are tender add the onion and cook until the onions are tender add the onion and cook until the onions are tender add the onion and cook until the onions are tender add the onion and cook until the onions are tender add the onion and cook until the onions are tender add the onion and cook until the onions are tender add the onion and cook until the onions are tender add the onion and cook until th