# NLP From Scratch: Translation with a Seq2Seq RNN model

In this project we will be teaching a neural network to translate from German to English.


This is made possible by the simple but powerful idea of the [sequence
to sequence network](https://arxiv.org/abs/1409.3215), in which two
recurrent neural networks work together to transform one sequence to
another. An encoder network condenses an input sequence into a vector,
and a decoder network unfolds that vector into a new sequence.


## Dependencies

In [19]:
# using Python 3.9
%pip install pandas torch matplotlib numpy ipython



In [20]:
from __future__ import unicode_literals, print_function, division
from io import open
import re
import random
import time
import pandas as pd
from IPython.display import Image
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
from utils import *
%load_ext autoreload
%autoreload 3
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Loading & preprocessing text data

Implement the function `read_txt` to read the input file and transform it into a dataframe, which is then fed into the `parse_data` function in order to be normalized. If the parameter reverse is True, the order of languages in the tuple are reversed resulting in reverse translation. Since there are a *lot* of example sentences and we want to train
something quickly, we'll only select a subset of pairs: Implement the `filter_pairs` function to drop all pairs which contain no word ending by a predefined list of suffixes.

In [22]:
def read_txt(path:str)-> pd.DataFrame:
  """
  #TODO: Task 1 (5 points)
  Parse the data from the file and return a DataFrame with columns ['ENG','GER'].
  """
  # Let's create an empty list to store the row of the data
  rows = []

  # And now let's read the file line by line and aggregate to row list
  with open(path, 'r', encoding='utf-8') as file:
    for line in file:
      rows.append(line.strip().split('\t'))


  # Now let's create a DataFrame from the rows list
  DF = pd.DataFrame(rows, columns=['ENG', 'GER', 'Info'])
  pairs = DF[['ENG', 'GER']]

  return pairs



def parse_data(pairs:pd.DataFrame, reverse=False)-> pd.DataFrame:
  pairs['GER'] = pairs['GER'].apply(normalize_string)
  pairs['ENG'] = pairs['ENG'].apply(normalize_string)

  if reverse:
    pairs = pairs.iloc[:, [1,0]]

  return pairs

# Let's read the data
path = 'data/pairs.txt'
DF = read_txt(path)
DF_n = parse_data(DF)
DF_n.head()

Unnamed: 0,ENG,GER
0,go .,geh .
1,hi .,hallo !
2,hi .,gru gott !
3,run !,lauf !
4,run .,lauf !


In [23]:
suffixes = ["hood", "ness", "ment", "ship", "ance", "ise", "ize", "ly", "tion", "ity"]

"""
#TODO: Task 2 (10 pt)

Implement the filter_pairs function that it takes in a pd.DataFrame of pairs of sentences in 2 languages
and only keeps the rows for which the sentence of the selected language contains at least one word ending
by one (or several) of the suffices. E.g.:

"Sisterhood is very important" --> keep
"We use Mentimeter in lectures --> drop
"Kindness and perseverance are virtues" --> keep

Your method should work by pattern detection instead than explicit iteration.

"""
def filter_pairs(pairs,
                 suffixes,
                 language="ENG"):

    # Let's create a pattern to match the suffixes
    pattern = r'\b\w+(' + '|'.join(suffixes) + r')\b'

    # Let's filter the pairs
    pairs = pairs[pairs[language].str.contains(pattern, regex=True, na=False)]

    return pairs

# Let's define the suffixes and call the functions
suffixes = ["hood", "ness", "ment", "ship", "ance", "ise", "ize", "ly", "tion", "ity"]
DF_f = filter_pairs(DF_n, suffixes, language='ENG')


  pairs = pairs[pairs[language].str.contains(pattern, regex=True, na=False)]


Each word in a language will be represented as a one-hot
vector. We'll need a unique index per word to use as the inputs and targets of
the networks later. To keep track of all this we will use a helper class
called ``Language`` which has word → index (``word2index``) and index → word
(``index2word``) dictionaries, as well as a count of each word
``word2count`` which will be used to replace rare words later.

In [24]:
SOS_token = 0
EOS_token = 1

"""
TODO: Task 3 (10 pt)

Implement the method stem, that takes as input a list of suffixes and maps all the words
that can be created with a common stem + one of the suffixes in the list to a common word stem.
For example, the words "precise" and "precisely" should be mapped to "precis".

The function should also take care of:
- removing the original words from all counters and class dictionaries
- updating word2count to the total counts for all the stemmed words
- updating the index in the dictionary so that its values run from 0 to n_words

You can create additional function for the task.

"""


class Language:
    def __init__(self):
        self.word2index = {} # maps word to integer index
        self.word2count = {} # maps word to its frequency
        self.index2word = {0: "SOS", 1: "EOS"} # maps index to a word
        self.n_words = 2  # Count SOS and EOS

    def add_sentence(self, sentence):
        for word in sentence.split(' '):
            self.add_word(word)

    def add_word(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1


    def remove_word(self, word):
        if word in self.word2index:
            del self.word2count[word]
            index = self.word2index[word]
            del self.word2index[word]
            del self.index2word[index]



    def stem(self, suffix_list):

    # Map words to their stems
        stem_map = {}
        for word in list(self.word2count.keys()):
            for suffix in suffix_list:
                if word.endswith(suffix):
                    stem = word[:-len(suffix)]
                    if stem:
                        stem_map[word] = stem
                    break

        # Update word2count by summing counts of words with the same stem
        new_word2count = {}
        for word, stem in stem_map.items():
            count = self.word2count[word]
            if stem in new_word2count:
                new_word2count[stem] += count
            else:
                new_word2count[stem] = count

        # Remove original words and add stemmed words
        for word in stem_map.keys():
            self.remove_word(word)
        for stem, count in new_word2count.items():
            self.add_word(stem)
            self.word2count[stem] = count

        # Update indices to ensure they are consecutive
        self.word2index = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2
        for word in self.word2count.keys():
            self.word2index[word] = self.n_words
            self.index2word[self.n_words] = word
            self.n_words += 1


In [25]:
# Example usage


lang = Language()
lang.add_sentence("precise precisely precision kindness")
lang.add_sentence("precisely precise")
print("Before stemming:")
print("word2index:", lang.word2index)
print("word2count:", lang.word2count)

suffixes = ["ly", "ion", "ness"]
lang.stem(suffixes)

print("\nAfter stemming:")
print("word2index:", lang.word2index)
print("word2count:", lang.word2count)

Before stemming:
word2index: {'precise': 2, 'precisely': 3, 'precision': 4, 'kindness': 5}
word2count: {'precise': 2, 'precisely': 2, 'precision': 1, 'kindness': 1}

After stemming:
word2index: {'precise': 2, 'precis': 3, 'kind': 4}
word2count: {'precise': 2, 'precis': 1, 'kind': 1}


The full process for preparing the data is:

-  Read text file and split into lines, split lines into pairs
-  Normalize text, filter by length and content
-  Make word lists from sentences in pairs




In [26]:
def prepare_data(pairs, suffixes, stem=None):

    print(f"Read {len(pairs)} sentence pairs")
    pairs = filter_pairs(pairs, suffixes=suffixes)
    print(f"Filtered {len(pairs)} sentence pairs")
    pairs = pairs.to_numpy()

    input_lang = Language()
    output_lang = Language()

    for pair in pairs:
        input_lang.add_sentence(pair[0])
        output_lang.add_sentence(pair[1])

    if stem:
        (input_lang.stem(suffixes) if stem=="input"
         else output_lang.stem(suffixes))

    print(f"Input language: {input_lang.n_words} words")
    print(f"Output language: {output_lang.n_words} words")
    return input_lang, output_lang, pairs

path = "data/pairs.txt"
suffixes = ["hood", "ness", "ment", "ship", "ance", "ise", "ize", "ly", "etion", "ity"]
pairs = parse_data(read_txt(path), reverse=True)
input_lang, output_lang, pairs = prepare_data(pairs, suffixes, stem="output")
### SHOW NOTEBOOK OUTPUT ###



Read 255817 sentence pairs


  pairs = pairs[pairs[language].str.contains(pattern, regex=True, na=False)]


Filtered 21510 sentence pairs
Input language: 11073 words
Output language: 6309 words



## Preparing Training Data

To train, for each pair we will need an input tensor (indexes of the
words in the input sentence) and target tensor (indexes of the words in
the target sentence). While creating these vectors we will add the SOS token at the beginning and the
EOS token at the end of both sequences.




In [27]:
input_lang.add_word("<unk>")
output_lang.add_word("<unk>")

In [28]:
def indexesFromSentence(lang, sentence):
    #return [lang.word2index[word] for word in sentence.split(' ')] removed as per TA's instructions
    return [lang.word2index.get(word, lang.word2index["<unk>"]) for word in sentence.split(' ')]

def tensorFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    indexes.insert(0, SOS_token)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype=torch.long, device=device).unsqueeze(0)

def tensorsFromPair(pair):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, target_tensor)

### Seq2Seq RNN Model

To train we run the input sentence through the encoder, and keep track
of every output and the latest hidden state. Then the decoder is given
the `<SOS>` token as its first input, and the last hidden state of the
encoder as its first hidden state. At each next iteration, the decoder makes a prediction
based on the most likely token predicted in the previous step. If
"teacher forcing" is used, real target outputs are used as
each next input, instead of using the decoder's guess as the next input.
Using teacher forcing causes it to converge faster but [when the trained
network is exploited, it may exhibit
instability](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.378.4095&rep=rep1&type=pdf).






In [29]:
""" Task 4 (10 pt)
Implement the decoder's iterative step to accommodate for a probabilistic usage of teacher forcing according to `teacher_forcing_ratio`.
At each step, the model should choose according to the ratio wheter to use its last prediction or the target token as an input for the next prediction.

"""
class Encoder(nn.Module):
    def __init__(self, input_dim, embed_dim, hidden_dim, n_layers=1):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(input_dim, embed_dim)
        self.gru = nn.GRU(embed_dim, hidden_dim, num_layers=n_layers, batch_first=True)

    def forward(self, src):
        # src: (batch_size, src_len)
        embedded = self.embedding(src)  # (batch_size, src_len, embed_dim)
        outputs, hidden = self.gru(embedded)  # outputs: (batch_size, src_len, hidden_dim), hidden: (n_layers, batch_size, hidden_dim)
        return hidden  # Only hidden state is returned for decoder

# Decoder Model
class Decoder(nn.Module):
    def __init__(self, output_dim, embed_dim, hidden_dim, n_layers=1):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(output_dim, embed_dim)
        self.gru = nn.GRU(embed_dim, hidden_dim, num_layers=n_layers, batch_first=True)
        self.fc_out = nn.Linear(hidden_dim, output_dim)

    def forward(self, trg, hidden):
        trg = trg.unsqueeze(1)
        embedded = self.embedding(trg)
        output, hidden = self.gru(embedded, hidden)
        prediction = self.fc_out(output.squeeze(1))
        return prediction, hidden

# Seq2Seq Model
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        # src: (1, src_len), trg: (1, trg_len)
        trg_len = trg.size(1)
        output_dim = self.decoder.fc_out.out_features

        outputs = torch.zeros(1, trg_len, output_dim).to(device)

        # Encode the source sequence
        hidden = self.encoder(src)

        # First input to the decoder is the <sos> token
        input = trg[:, 0]

        for t in range(1, trg_len):
            # Decode one token at a time
            output, hidden = self.decoder(input, hidden)
            outputs[:, t, :] = output

            # Decide whether to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1)  # Get the predicted token
            input = trg[:, t] if teacher_force else top1  # Use teacher forcing or predicted token


        return outputs

In [31]:
cfg = {"n_iters": 10**3,
       "print_every":10,
       "plot_every":100,
       "learning_rate":0.01,
       "teacher_forcing_ratio":.5}

teacher_forcing_ratio = .5

def train(training_pair,
          model,
          optimizer,
          criterion,
          teacher_forcing_ratio=0.5):

    optimizer.zero_grad()
    input_tensor, target_tensor = training_pair
    pred_tensor = model(input_tensor, target_tensor, teacher_forcing_ratio)
    loss = criterion(pred_tensor.squeeze(0), target_tensor.squeeze(0))

    loss.backward()
    optimizer.step()
    return loss.item()


def trainIters(model, cfg):

    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    optimizer = optim.SGD(model.parameters(),
                          lr=cfg["learning_rate"])

    training_pairs = [tensorsFromPair(random.choice(pairs))
                      for i in range(cfg["n_iters"])]

    criterion = nn.NLLLoss()

    for iter in range(1, cfg["n_iters"] + 1):
        training_pair = training_pairs[iter - 1]
        loss = train(training_pair,
                     model,
                     optimizer,
                     criterion,
                     cfg["teacher_forcing_ratio"]
                     )
        print_loss_total += loss
        plot_loss_total += loss

        if iter % cfg["print_every"] == 0:
            print_loss_avg = print_loss_total / cfg["print_every"]
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, iter / cfg["n_iters"]),
                                         iter, iter / cfg["n_iters"] * 100, print_loss_avg))

        if iter % cfg["plot_every"] == 0:
            plot_loss_avg = plot_loss_total / cfg["plot_every"]
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

    showPlot(plot_losses)

    return plot_losses



## Evaluation

Evaluation is mostly the same as training, but there are no targets so
we simply feed the decoder's predictions back to itself for each step.
Every time it predicts a word we add it to the output string, and if it
predicts the EOS token we stop there.




In [32]:
def evaluate(model, sentence):
    input_tensor = tensorFromSentence(input_lang, sentence)
    with torch.no_grad():
        pred_tensor = model(input_tensor, input_tensor, teacher_forcing_ratio=0)
        pred_indices = pred_tensor.squeeze(0).argmax(1).cpu().numpy()

    pred_sentence = ' '.join([output_lang.index2word[i] for i in pred_indices])

    return pred_sentence

#We can evaluate random sentences from the training set and print out the input, target, and output to make some subjective quality judgements
def evaluateRandomly(model, n=10):
    for i in range(n):
        pair = random.choice(pairs)
        print('>', pair[0])
        print('=', pair[1])
        output_sentence = evaluate(model, pair[0])
        print('<', output_sentence)
        print('')

## Training and Evaluating

For this dataset we can use relatively small networks of 256 hidden nodes and a
single GRU layer. When training from scratch after about 40 minutes on a MacBook CPU we'll get some
reasonable results.



In [33]:
decoder = Decoder(output_lang.n_words, 256, 512)
encoder = Encoder(input_lang.n_words, 256, 512)
model = Seq2Seq(encoder, decoder).to(device)
losses = trainIters(model, cfg)
### TODO: SHOW NOTEBOOK OUTPUT ###

0m 3s (- 6m 25s) (10 1%) -0.0372
0m 4s (- 3m 16s) (20 2%) -0.0925
0m 4s (- 2m 13s) (30 3%) -0.1103
0m 4s (- 1m 41s) (40 4%) -0.1463
0m 4s (- 1m 23s) (50 5%) -0.1704
0m 4s (- 1m 10s) (60 6%) -0.2668
0m 4s (- 1m 1s) (70 7%) -0.3313
0m 4s (- 0m 54s) (80 8%) -0.3898
0m 4s (- 0m 49s) (90 9%) -0.5437
0m 4s (- 0m 44s) (100 10%) -0.4431
0m 5s (- 0m 41s) (110 11%) -0.6407
0m 5s (- 0m 38s) (120 12%) -0.8393
0m 5s (- 0m 35s) (130 13%) -1.0080
0m 5s (- 0m 33s) (140 14%) -1.3179
0m 5s (- 0m 31s) (150 15%) -1.4990
0m 5s (- 0m 30s) (160 16%) -1.6857
0m 5s (- 0m 28s) (170 17%) -1.9657
0m 6s (- 0m 27s) (180 18%) -2.4881
0m 6s (- 0m 26s) (190 19%) -2.7084
0m 6s (- 0m 25s) (200 20%) -3.3129
0m 6s (- 0m 24s) (210 21%) -4.2730
0m 6s (- 0m 23s) (220 22%) -3.6393
0m 6s (- 0m 23s) (230 23%) -5.2942
0m 7s (- 0m 23s) (240 24%) -5.4284
0m 7s (- 0m 22s) (250 25%) -6.7103
0m 7s (- 0m 21s) (260 26%) -5.8332
0m 8s (- 0m 21s) (270 27%) -7.5440
0m 8s (- 0m 21s) (280 28%) -8.3572
0m 8s (- 0m 20s) (290 28%) -9.9175
0m 8

In [34]:
#TODO: print outcome of trained model

evaluateRandomly(model, 5)

> tom mag basketball sehr gern .
= tom really likes basketball .
< SOS EOS EOS EOS EOS EOS EOS EOS

> kann ich mich auf dich verlassen ?
= can i rely on you ?
< SOS EOS EOS EOS EOS EOS EOS EOS EOS

> tom war sichtlich verargert .
= tom was visibly upset .
< SOS EOS EOS EOS EOS EOS EOS

> sie machte vor freude einen luftsprung als sie die nachricht horte .
= she jumped for joy the moment she heard the news .
< SOS EOS EOS EOS EOS EOS EOS EOS EOS EOS EOS EOS EOS EOS

> meine stimme bekommt tom ganz sicher nicht !
= tom certainly won t get my vote .
< SOS EOS EOS EOS EOS EOS EOS EOS EOS EOS



# Transfer learning
## Fine tune out-of-the box encoder-decoder model

In [36]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, AdamW
from torch.utils.data import Dataset, DataLoader
import torch

In [37]:
"""
TODO: Task 5 (15 pt)

Load a transformer (seq2seq model using attention layers) and fine-tune it to your problem.
Show the training progress of the model and the training metrics. Briefly
explain (3-4 sentences) the model choice and comment on the outcomes.

Note: you don't have to use the above-defined methods for training.
Splitting the dataset in train and test dataset is welcomed, but not required for the task.

"""

# Dataset Class
class Seq2SeqDataset(Dataset):
    def __init__(self, pairs, tokenizer, source_lang="ENG", target_lang="GER", max_len=128):
        self.pairs = pairs  # DataFrame containing source and target sentences
        self.tokenizer = tokenizer
        self.source_lang = source_lang
        self.target_lang = target_lang
        self.max_len = max_len

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx):
        source_sentence = self.pairs.iloc[idx][self.source_lang]
        target_sentence = self.pairs.iloc[idx][self.target_lang]

        # Add task-specific prefix (required for T5)
        source_sentence = f"translate English to German: {source_sentence}"

        source_encodings = self.tokenizer(
            source_sentence, max_length=self.max_len, padding="max_length", truncation=True, return_tensors="pt"
        )
        target_encodings = self.tokenizer(
            target_sentence, max_length=self.max_len, padding="max_length", truncation=True, return_tensors="pt"
        )

        return {
            "input_ids": source_encodings["input_ids"].squeeze(),
            "attention_mask": source_encodings["attention_mask"].squeeze(),
            "labels": target_encodings["input_ids"].squeeze(),
        }

In [38]:
# Load Pre-trained Model and Tokenizer
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [39]:
DF_f.shape

(25562, 2)

In [40]:
# Randomly sample 25% of the dataset
DF_f_subset = DF_f.sample(frac=0.25, random_state=42)
DF_f_subset.shape

(6390, 2)

I tried to train the model using the entire dataset (DF_f), which has approximately 25,500 data points, but the training process was extremely slow To address this, I have trained the model on a smaller subset of the data. While this makes the training faster, it also increases the risk of overfitting, as smaller datasets have limited ability to generalise well.

In [41]:
# Prepare Dataset and DataLoader
train_dataset = Seq2SeqDataset(DF_f_subset, tokenizer)  # DF_f is the filtered DataFrame from earlier tasks
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)


In [42]:
# Optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)



In [None]:
# Training Loop
epochs = 2
for epoch in range(epochs):
    model.train()
    epoch_loss = 0
    for batch in train_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    print(f"Epoch {epoch + 1}/{epochs}, Loss: {epoch_loss / len(train_loader):.4f}")

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch 1/2, Loss: 0.3853


In [None]:
# Testing the Model
model.eval()
test_sentence = "translate English to German: I ultimately finished the assignment."
input_ids = tokenizer(test_sentence, return_tensors="pt", truncation=True, max_length=128).input_ids.to(device)
output_ids = model.generate(input_ids)
print("Generated Translation:", tokenizer.decode(output_ids[0], skip_special_tokens=True))



Generated Translation: ich habe die aufgabe letztlich beendet.


# Credits

This problem set is based upon an official PyTorch [tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html). Many thanks to PyTorch, [Sean Robertson](https://github.com/spro/practical-pytorch) and  [Florian Nachtigall](https://github.com/FlorianNachtigall).

Be cautious with looking in the original notebook for answers. Many details have been changed and you won't be able to copy-and-paste solutions.
