<a href="https://colab.research.google.com/github/BigTMiami/colabtest/blob/main/Colab_Machine_Translation_LSTM_TUNING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


*Machine Translation Jupyter Notebook.  (c) 2021 Georgia Tech*

*Copyright 2021, Georgia Institute of Technology (Georgia Tech) <br>Atlanta, Georgia 30332<br>All Rights Reserved*

*Georgia Tech asserts copyright ownership of this template and all derivative works, including solutions to the projects assigned in this course. Students and other users of this template code are advised not to share it with others or to make it available on publicly viewable websites including repositories such as Github, Bitbucket, and Gitlab.  This copyright statement should not be removed or edited.*

*Sharing solutions with current or future students of CS 7643 Deep Learning is prohibited and subject to being investigated as a GT honor code violation.*

*DO NOT EDIT ANYTHING ABOVE THIS LINE*

# Machine Translation with Seq2Seq and Transformers
In this exercise you will implement a [Sequence to Sequence(Seq2Seq)](https://arxiv.org/abs/1703.03906) and a [Transformer](https://arxiv.org/pdf/1706.03762.pdf) model and use them to perform machine translation.

**A quick note: if you receive the following TypeError "super(type, obj): obj must be an instance or subtype of type", try re-importing that part or restarting your kernel and re-running all cells.** Once you have finished making changes to the model constuctor, you can avoid this issue by commenting out all of the model instantiations after the first (e.g. lines starting with "model = TransformerTranslator(*args, **kwargs)").

# ** 1: Introduction**

## Multi30K: Multilingual English-German Image Descriptions

[Multi30K](https://github.com/multi30k/dataset) is a dataset for machine translation tasks. It is a multilingual corpus containing English sentences and their German translation. In total it contains 31014 sentences(29000 for training, 1014 for validation, and 1000 for testing).
As one example:

En: `Two young, White males are outside near many bushes.`

De: `Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.`

You can read more info about the dataset [here](https://arxiv.org/abs/1605.00459). The following parts of this assignment will be based on this dataset.

## TorchText: A PyTorch Toolkit for Text Dataset and NLP Tasks
[TorchText](https://github.com/pytorch/text) is a PyTorch package that consists of data processing utilities and popular datasets for natural language. They serve to help with data splitting and loading, token encoding, sequence padding, etc. You don't need to know about how TorchText works in detail, but you might want to know about why those classes are needed and what operations are necessary for machine translation. This knowledge can be migrated to all sequential data modeling. In the following parts, we will provide you with some code to help you understand.

 You can refer to torchtext's documentation(v0.9.0) [here](https://pytorch.org/text/).

## Spacy
Spacy is package designed for tokenization in many languages. Tokenization is a process of splitting raw text data into lists of tokens that can be further processed. Since TorchText only provides tokenizer for English, we will be using Spacy for our assignment.


**Notice: For the following assignment, we strongly recommend you to work in a virtual python environment. We recommend Anaconda, a powerful environment control tool. You can download it [here](https://www.anaconda.com/products/individual)**.

## ** 1.1: Prerequisites**
Before you start this assignment, you need to have all required packages installed either on the terminal you are using, or in the virtual environment. Please make sure you have the following package installed:

`PyTorch, TorchText, Spacy, Tqdm, Numpy`

You can first check using either `pip freeze` in terminal or `conda list` in conda environment. Then run the following code block to make sure they can be imported.

In [None]:
%pip install torchtext torch  spacy tqdm numpy jupyter notebook

In [None]:
from google.colab import drive
drive.mount("/content/drive")
%cd '/content/drive/MyDrive/assignment4_spring24'

In [None]:
import os
# add for trace error improvement
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

In [None]:
# Just run this block. Please do not modify the following code.
import math
import time
import io
import numpy as np
import csv
from IPython.display import Image

# Pytorch package
import torch
import torch.nn as nn
import torch.optim as optim

# Torchtest package
import torchtext
from torchtext.datasets import Multi30k
from torch.utils.data import DataLoader
from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import vocab
from torchtext.utils import download_from_url, extract_archive
from torch.nn.utils.rnn import pad_sequence

# Tqdm progress bar
from tqdm import tqdm_notebook, tqdm

# Code provide to you for training and evaluation
from utils import train, evaluate, set_seed_nb, unit_test_values, deterministic_init, plot_curves

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

Once you properly import the above packages, you can proceed to download Spacy English and German tokenizers by running the following command in your **terminal**. They will take some time.

`python -m spacy download en_core_web_sm`

`python -m spacy download de_core_news_sm`

In [None]:
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm

Now lets check your GPU availability and load some sanity checkers. By default you should be using your gpu for this assignment if you have one available.

In [None]:
# Check device availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("You are using device: %s" % device)

In [None]:
# load checkers
d1 = torch.load('./data/d1.pt')
d2 = torch.load('./data/d2.pt')
d3 = torch.load('./data/d3.pt')
d4 = torch.load('./data/d4.pt')

## **1.2: Preprocess Data**
With TorchText and Spacy tokenizers ready, you can now prepare the data using *TorchText* objects. Just run the following code blocks. Read the comment and try to understand what they are for.

In [None]:
MAX_LEN = 20
url_base = 'https://raw.githubusercontent.com/multi30k/dataset/master/data/task1/raw/'
train_urls = ('train.de.gz', 'train.en.gz')
val_urls = ('val.de.gz', 'val.en.gz')
test_urls = ('test_2016_flickr.de.gz', 'test_2016_flickr.en.gz')

train_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in train_urls]
val_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in val_urls]
test_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in test_urls]

de_tokenizer = get_tokenizer('spacy', language='de_core_news_sm')
en_tokenizer = get_tokenizer('spacy', language='en_core_web_sm')

In [None]:
def build_vocab(filepath, tokenizer):
  counter = Counter()
  with io.open(filepath, encoding="utf8") as f:
    for string_ in f:
      counter.update(tokenizer(string_.lower()))
  return vocab(counter, specials=['<unk>', '<pad>', '<sos>', '<eos>'], min_freq=2)


de_vocab = build_vocab(train_filepaths[0], de_tokenizer)
en_vocab = build_vocab(train_filepaths[1], en_tokenizer)
de_vocab.set_default_index(de_vocab['<unk>'])
en_vocab.set_default_index(en_vocab['<unk>'])

In [None]:
def data_process(filepaths):
  raw_de_iter = iter(io.open(filepaths[0], encoding="utf8"))
  raw_en_iter = iter(io.open(filepaths[1], encoding="utf8"))
  data = []
  for (raw_de, raw_en) in zip(raw_de_iter, raw_en_iter):
    raw_en_l=raw_en.lower()     #turn sentences to lower case
    raw_de_l=raw_de.lower()
    de_tensor = torch.tensor([de_vocab[token] for token in de_tokenizer(raw_de_l)],
                            dtype=torch.long)
    en_tensor = torch.tensor([en_vocab[token] for token in en_tokenizer(raw_en_l)],
                            dtype=torch.long)
    if len(de_tensor) <= MAX_LEN-2 and len(en_tensor) <= MAX_LEN-2:
        data.append((de_tensor, en_tensor))
  return data

In [None]:
train_data = data_process(train_filepaths)
val_data = data_process(val_filepaths)
test_data = data_process(test_filepaths)

In [None]:
PAD_IDX = de_vocab['<pad>']
SOS_IDX = de_vocab['<sos>']
EOS_IDX = de_vocab['<eos>']

In [None]:
def generate_batch(data_batch):

    de_batch, en_batch = [], []
    for (de_item, en_item) in data_batch:
          en_batch.append(torch.cat([torch.tensor([SOS_IDX]), en_item, torch.tensor([EOS_IDX])], dim=0))
          de_batch.append(torch.cat([torch.tensor([SOS_IDX]), de_item, torch.tensor([EOS_IDX])], dim=0))
    en_batch = pad_sequence(en_batch, padding_value=PAD_IDX)
    de_batch = pad_sequence(de_batch, padding_value=PAD_IDX)
    fix=torch.ones(MAX_LEN,en_batch.shape[1])
    two= pad_sequence([de_batch,en_batch, fix], padding_value=PAD_IDX)
    de_batch=two[:,0,]
    en_batch=two[:,1,]
    return de_batch, en_batch

In [None]:
# Get the input and the output sizes for model
input_size = len(de_vocab)
output_size = len(en_vocab)
print (input_size,output_size)

# **3: Train a Seq2Seq Model**
In this section, you will be working on implementing a simple Seq2Seq model. You will first implement an Encoder and a Decoder, and then join them together with a Seq2Seq architecture. You will need to complete the code in *Decoder.py*, *Encoder.py*, and *Seq2Seq.py* under *seq2seq* folder. Please refer to the instructions in those files.

## **3.3: Implement the Seq2Seq**
In this section you will be implementing the Seq2Seq model that utilizes the Encoder and Decoder you implemented. Please refer to the instructions in *seq2seq/Seq2Seq.py*. Run the following block to check your implementation.

In [None]:
import json
from datetime import datetime
def save_experiment_results(filename, results_dir="results/", **kwargs):
    kwargs["save_time" ] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    with open(results_dir + filename + ".json", 'w') as f:
        json.dump(kwargs, f, indent=4)
    return

In [None]:
def train_and_plot (model, optimizer, scheduler, criterion, filename, EPOCHS, train_loader,valid_loader):
  train_perplexity_history = []
  valid_perplexity_history = []

  for epoch_idx in range(EPOCHS):
      print("-----------------------------------")
      print("Epoch %d" % (epoch_idx+1))
      print("-----------------------------------")

      train_loss, avg_train_loss = train(model, train_loader, optimizer, criterion, device=device)
      scheduler.step(train_loss)

      val_loss, avg_val_loss = evaluate(model, valid_loader, criterion, device=device)

      train_perplexity_history.append(np.exp(avg_train_loss))
      valid_perplexity_history.append(np.exp(avg_val_loss))

      print("Training Loss: %.4f. Validation Loss: %.4f. " % (avg_train_loss, avg_val_loss))
      print("Training Perplexity: %.4f. Validation Perplexity: %.4f. " % (np.exp(avg_train_loss), np.exp(avg_val_loss)))

  plot_curves(train_perplexity_history, valid_perplexity_history, filename)

  return avg_train_loss, avg_val_loss, np.exp(avg_train_loss), np.exp(avg_val_loss)

In [None]:
def create_token_string(word_tensor, example_count=3, vocab=en_vocab):
    token_strings = []
    for i in range(example_count):
        target_words = vocab.lookup_tokens(word_tensor[i].tolist())
        translation_string = ""
        for word in target_words:
            if word in ['<sos>', '<eos>', '<pad>', '\n']:
                continue
            translation_string += word + " "
        token_strings.append(translation_string)
    return  token_strings



def create_translation_string(model, dataloader, example_count=3 ):
    with torch.no_grad():
        data = next(iter(dataloader))
        source = data[0].transpose(1, 0).to(device)
        target = data[1].transpose(1, 0).to(device)
        batch_size, seq_len = target.shape
        translation = model(source)
        translation = translation.reshape(batch_size * seq_len, translation.shape[-1])
        translation = torch.argmax(translation, dim=1)
        translation = translation.reshape(batch_size,seq_len)

    target_examples = create_token_string(target, example_count)
    translation_examples = create_token_string(translation, example_count)

    return target_examples, translation_examples

In [None]:
%autoreload 2
from models.seq2seq.Encoder import Encoder
from models.seq2seq.Decoder import Decoder
from models.seq2seq.Seq2Seq import Seq2Seq


In [None]:
def run_training(**kwargs):
  encoder_emb_size = kwargs["encoder_emb_size"] if "encoder_emb_size" in kwargs else 128
  encoder_hidden_size = kwargs["encoder_hidden_size"] if "encoder_hidden_size" in kwargs else 128
  encoder_dropout = kwargs["encoder_dropout"] if "encoder_dropout" in kwargs else 0.2

  decoder_emb_size = kwargs["decoder_emb_size"] if "decoder_emb_size" in kwargs else 128
  decoder_hidden_size = kwargs["decoder_hidden_size"] if "decoder_hidden_size" in kwargs else 128
  decoder_dropout = kwargs["decoder_dropout"] if "decoder_dropout" in kwargs else 0.2

  learning_rate = kwargs["learning_rate"] if "learning_rate" in kwargs else 0.001
  model_type = kwargs["model_type"] if "model_type" in kwargs else "LSTM"
  EPOCHS = kwargs["EPOCHS"] if "EPOCHS" in kwargs else 20

  attention = kwargs["attention"] if "attention" in kwargs else True
  BATCH_SIZE = kwargs["BATCH_SIZE"] if "BATCH_SIZE" in kwargs else 128
  EPOCHS = kwargs["EPOCHS"] if "EPOCHS" in kwargs else 20

  train_loader = DataLoader(train_data, batch_size=BATCH_SIZE,
                          shuffle=False, collate_fn=generate_batch)
  valid_loader = DataLoader(val_data, batch_size=BATCH_SIZE,
                          shuffle=False, collate_fn=generate_batch)
  test_loader = DataLoader(test_data, batch_size=BATCH_SIZE,
                        shuffle=False, collate_fn=generate_batch)

  #input size and output size
  input_size = len(de_vocab)
  output_size = len(en_vocab)

  # Declare models, optimizer, and loss function
  set_seed_nb()
  encoder = Encoder(input_size, encoder_emb_size, encoder_hidden_size, decoder_hidden_size, dropout = encoder_dropout, model_type = model_type)
  decoder = Decoder(decoder_emb_size, encoder_hidden_size, encoder_hidden_size, output_size, dropout = decoder_dropout, model_type = model_type)
  seq2seq_model = Seq2Seq(encoder, decoder, device, attention=attention)

  optimizer = optim.Adam(seq2seq_model.parameters(), lr = learning_rate)
  scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer)
  criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

  filename=f'lstm_tuned_batch'
  for key,value in kwargs.items():
    filename += f"_{key}_{value}".replace('.','_')

  print(f"EPOCHS:{EPOCHS} filename:{filename}")

  train_loss, val_loss, train_perplexity, val_perplexity = train_and_plot(seq2seq_model, optimizer, scheduler, criterion, filename, EPOCHS,train_loader,valid_loader)
  target_examples, translation_examples = create_translation_string(seq2seq_model, test_loader)
  save_experiment_results(filename, train_loss=train_loss, val_loss=val_loss, train_perplexity=train_perplexity, val_perplexity=val_perplexity,
                          input_size=input_size, output_size=output_size,
                          encoder_emb_size=encoder_emb_size, encoder_hidden_size=encoder_hidden_size, encoder_dropout=encoder_dropout,
                          decoder_emb_size=decoder_emb_size, decoder_hidden_size=decoder_hidden_size, decoder_dropout=decoder_dropout,
                          learning_rate=learning_rate, model_type=model_type, EPOCHS=EPOCHS, attention=attention,
                          target_examples=target_examples, translation_examples=translation_examples, BATCH_SIZE=BATCH_SIZE)

In [None]:
# run_training(encoder_dropout=0.1)
# run_training(encoder_dropout=0.4)
# run_training(decoder_dropout=0.1)
# run_training(decoder_dropout=0.4)
# run_training(encoder_emb_size=64)
# run_training(encoder_emb_size=256)
# run_training(decoder_emb_size=64)
# run_training(decoder_emb_size=256)
run_training(decoder_hidden_size=64)
run_training(decoder_hidden_size=256)
run_training(encoder_emb_size=256,decoder_emb_size=64,decoder_dropout=0.3)
run_training(EPOCH=80,BATCH_SIZE=256,learning_rate=0.003)

In [None]:
from google.colab import runtime
runtime.unassign()