## Check what GPU you got
Click the Runtime dropdown at the top of the page, then Change Runtime Type and confirm the instance type is GPU.

In [1]:
!nvidia-smi

Thu Apr 30 04:34:25 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

## Pre-requisites
Connect to Google Drive to get the Dataset and code



In [2]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

!python -m spacy download en_core_web_sm
!python -m spacy download fr_core_news_sm

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
Collecting fr_core_news_sm==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.2.5/fr_core_news_sm-2.2.5.tar.gz (14.7MB)
[K     |████████████████████████████████| 14.7MB 762kB/s 
Building wheels for collected packages: fr-core-news-sm
  Building wheel for fr-core-news-sm (setup.py) ... [?25l[?25

### **Important: Reset Runtime**

Note: there is a slight bug with Google Colab. After installing Spacy, you need to restart the Jupyter Notebook runtime.

There are two ways:
1. Click on the Runtime dropdown, and select "Restart Runtime". Once that is done, proceed to the next step (no need to remount the drive).
2. Run the code below. It will kill the current process, effectively restarting the runtime.

In [0]:
import os
os.kill(os.getpid(), 9)

##Get the data

### **Get a list of vocabs:**
The list of vocabs are already stored in the Google Drive folder; thus, we just have to load it.

In [1]:
import sys
sys.path.append('/content/drive/My Drive/English-to-French-Translation/src/dataloader/')

import vocabs

def combine_vocabs(vocab1, vocab2):
    combined_words = set()

    for key in vocab1.get_word2id():
        combined_words.add(key)

    for key in vocab2.get_word2id():
        combined_words.add(key)

    word2id = dict((word, index) for index, word in enumerate(combined_words))
    id2word = dict((index, word) for index, word in enumerate(combined_words))

    print('Built', len(word2id), 'vocabs')

    return vocabs.VocabDataset(word2id, id2word)

models_dir = "/content/drive/My Drive/English-to-French-Translation/models/Hansard/"
french_vocabs = vocabs.load_vocabs_from_file(models_dir + 'vocab.french.gz')
english_vocabs = vocabs.load_vocabs_from_file(models_dir + 'vocab.english.gz')

combined_vocabs = combine_vocabs(english_vocabs, french_vocabs)
combined_vocabs = sorted([word for word in combined_vocabs.get_word2id()])


Loaded 33639 words
Loaded 25370 words
Built 52275 vocabs


### **Tokenize and split the dataset**
We will have three types of datasets:
1. Training data: it is the data used to train our model
2. Validation (val) data: it is the data used to test our model at each step of the training process
3. Test data: it is the data used to test our model after all the training is done

How to get the three types of data?
* The test set is already in the `Test` folder
* The validation set is a piece of the data in the `Train`` folder

The code to get our three types of datasets is:

In [8]:
import sys
sys.path.append('/content/drive/My Drive/English-to-French-Translation/src/dataloader')

import os
from tqdm.notebook import tqdm

import torch
import utils
import vocabs

class Seq2VecDataset(torch.utils.data.Dataset):
    def __init__(self, dir_: str, vocabs: list, langs: list):
        ''' Reads in the sentences in `dir_`, convert each word into its numerical token, 
            and tags it with its language

            Parameters
            ----------
            dir_ : str
                The directory with the training data
            vocabs : [ str ]
                A list of vocabs
            lang : [ str ]
                The set of languages to capture in `dir_`
        '''

        pairs = []
        word2index = { word: index for index, word in enumerate(vocabs)}
        
        for lang_index, lang in enumerate(langs):

            # Get the spacy instance
            spacy_instance = utils.get_spacy_instance(lang)

            # Get all the filenames with that language
            transcriptions = utils.get_parallel_text(dir_, [lang])
            
            # Get all the filepaths with that language
            filepaths = [os.path.join(dir_, trans[0]) for trans in transcriptions]

            # Get the iterator that will read and tokenize all the content in filepaths
            iterator = utils.read_transcription_files(filepaths, spacy_instance)

            # Get the number of sentences in entire corpus with that language
            corpus_size = utils.get_size_of_corpus(filepaths)

            for (f, f_fn, _), in tqdm(iterable=zip(iterator), total=corpus_size):
                if not f:
                    continue

                # Ignore sentences with no words in vocabs
                has_known_word = sum([1 if word in word2index else 0 for word in f]) > 0
                if not has_known_word:
                    continue

                pairs.append((f, lang_index))
        
        self.langs = langs
        self.vocabs = vocabs
        self.pairs = pairs
        self.word2index = word2index
    
    def __len__(self):
        ''' Returns the number of sentences in this dataset '''
        return len(self.pairs)

    def __getitem__(self, i):
        ''' Returns the i-th sentence in this dataset '''
        f, lang_index = self.pairs[i]

        # Get the bag of words for f where vec[i] = 1 if word i exists; else 0
        F = torch.zeros(len(self.vocabs))
        for word in f:
            if word in self.word2index:
                index = self.word2index[word]
                F[index] = 1

        Y = torch.tensor(lang_index)

        return F, Y

utils.get_spacy_instance('fr')
train_dir = "/content/drive/My Drive/English-to-French-Translation/data/Hansard/Training"
test_dir = "/content/drive/My Drive/English-to-French-Translation/data/Hansard/Testing"

num_epochs = 10
batch_size = 32
device = torch.device('cuda')
train_test_split_ratio = 0.75

# dataset = Seq2VecDataset(train_dir, combined_vocabs, ['en', 'fr'])

# num_training_data = int(len(dataset) * train_test_split_ratio)
# num_val_data = len(dataset) - num_training_data

# train_dataset, val_dataset = torch.utils.data.random_split(
#   dataset, [num_training_data, num_val_data]
# )

# train_dataloader = torch.utils.data.DataLoader(
#   train_dataset, 
#   batch_size=batch_size, 
#   shuffle=True,
#   pin_memory=(device.type == 'cuda'),
#   num_workers=4
# )
# val_dataloader = torch.utils.data.DataLoader(
#   val_dataset, 
#   batch_size=batch_size, 
#   shuffle=True,
#   pin_memory=(device.type == 'cuda'),
#   num_workers=4
# )

test_dataset = Seq2VecDataset(test_dir, combined_vocabs, ['en', 'fr'])
test_dataloader = torch.utils.data.DataLoader(
  test_dataset, 
  batch_size=batch_size, 
  shuffle=True,
  pin_memory=(device.type == 'cuda'),
  num_workers=4
)

HBox(children=(IntProgress(value=0, max=256058), HTML(value='')))




HBox(children=(IntProgress(value=0, max=256058), HTML(value='')))




## Build the Classifier

In [0]:
import torch
from torch import nn
import torch.nn.functional as F

class Seq2VecNN(nn.Module):
    def __init__(self, vocab_size, num_classes, num_neurons_per_layer=[1000, 1000, 1000]):

        super().__init__()

        self.vocab_size = vocab_size
        self.num_classes = num_classes

        layers = []

        prev_layer_count = vocab_size
        for num_neurons in num_neurons_per_layer:
            layers.append(nn.Linear(prev_layer_count, num_neurons))
            layers.append(nn.ReLU())
            prev_layer_count = num_neurons

        self.feedforward_layer = nn.Sequential(*layers)

        self.output_layer = nn.Linear(prev_layer_count, self.num_classes)
        
    def forward(self, F):
        ''' Given a batch of sequences, and its sequence lengths, output a softmax of
            its class

            Parameters
            ----------
            F : torch.LongTensor (N, self.vocab_size)
                It is a batch of bag-of-words

            Returns
            -------
            logits_t : torch.FloatTensor (N, self.vocab_size)
                It is a un-normalized distribution over the classes for the n-th sequence:
                Pr_b(i) = softmax(logits_t[i]) for i in self.num_classes
        '''
        x = self.feedforward_layer(F)
        return self.output_layer(x)
        

## **Training:**
We are going to train our model and see if for each epoch it improves its predictions on the validation set

How we train it:
* It uses teacher forcing:
  1. First, we have the source and target sentences
  2. Then, we feed the source sentence into the Encoder. The encoder returns the attended source sentence.
  3. Next, we feed the attended source sentence and the target sentence into the decoder
  4. We check if the output of the decoder is the same as the target sentence

How we make predictions:
* It is similar to RNNs, where we feed in the source input to the encoder, feed in an SOS in the decoder, and we take the outputs of the previous decoder as inputs to the next decoder
* In more detail:
  1. First, we have the source sentence
  2. Then, we feed the source sentence into the Encoder to get an attended version of the source sentence
  3. Next, we feed an SOS token and the attended source sentence in the Decoder as the first input to our decoder
  4. We get the outputs of the decoder and use it as the next token to feed as the second input to our decoder
  5. Repeat 3-4 until we get an EOS token

### **How to train our model for one epoch:**

In [0]:
import torch
from torch import nn
import torch.nn.functional as F

def train_for_one_epoch(model, loss_function, optimizer, train_dataloader, device):
  model.train()

  train_loss = 0.0
  train_accuracy = 0.0

  for F, Y in tqdm(train_dataloader, total=len(train_dataloader)):

    # Send data to device
    F = F.to(device)
    Y = Y.to(device)   

    # Forward-prop with the model
    optimizer.zero_grad()
    logits = model(F)
    
    # Compute the loss
    batch_loss = loss_function(logits, Y)
    train_loss += batch_loss.item()

    # Compute the accuracy
    _, predictions = torch.max(torch.round(torch.sigmoid(logits)), 1)    
    batch_accuracy = predictions.eq(Y).sum().float().item() / Y.shape[0]
    train_accuracy += batch_accuracy
    
    batch_loss.backward()
    optimizer.step()

    del F, Y

  train_loss /= len(train_dataloader)
  train_accuracy /= len(train_dataloader)

  return train_loss, train_accuracy


### **How to evaluate our model:**

In [0]:
def evaluate_model(model, loss_function, test_dataloader, device):

  model.eval()

  eval_loss = 0
  eval_accuracy = 0
  num_sequences = 0

  for F, Y in tqdm(test_dataloader, total=len(test_dataloader)):

    # Send data to device
    F = F.to(device)
    Y = Y.to(device)

    # Get predictions
    logits = model(F)

    # Compute the loss
    eval_loss = loss_function(logits, Y).item()

    # Compute the accuracy
    _, predictions = torch.max(torch.round(torch.sigmoid(logits)), 1)    
    batch_accuracy = predictions.eq(Y).sum().float().item() / Y.shape[0]
    eval_accuracy += batch_accuracy

    del F, Y

  eval_loss /= len(test_dataloader)
  eval_accuracy /= len(test_dataloader)

  return eval_loss, eval_accuracy
  


### **How to train our model for many epochs:**

In [6]:
import torch
from tqdm.notebook import tqdm

def train():
    global combined_vocabs, train_dataloader, val_dataloader

    device = torch.device("cuda")

    model = Seq2VecNN(len(combined_vocabs), 2, num_neurons_per_layer=[100, 25])
    model = model.to(device)

    optimizer = torch.optim.Adam(model.parameters())

    patience = 3 #float("inf")
    num_epochs = float("inf")
    
    best_eval_loss = float("inf")

    num_poor = 0
    epoch = 1
    
    while epoch <= num_epochs and num_poor < patience:
        
      loss_function = torch.nn.CrossEntropyLoss()

      train_loss, train_accuracy = train_for_one_epoch(model, loss_function, optimizer, train_dataloader, device)

      eval_loss, eval_accuracy = evaluate_model(model, loss_function, val_dataloader, device)

      print(f'Epoch={epoch} Train-Loss={train_loss} Train-Acc={train_accuracy} Test-Loss={eval_loss} Test-Acc={eval_accuracy} Num-Poor={num_poor}')

      if eval_loss >= best_eval_loss:
        num_poor += 1

      else:
        num_poor = 0
        best_eval_loss = eval_loss

      epoch += 1

    return model

trained_model = train()


    


HBox(children=(IntProgress(value=0, max=47896), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15966), HTML(value='')))


Epoch=1 Train-Loss=0.012140433589706078 Train-Acc=0.9931159439201603 Test-Loss=3.2105721224374257e-10 Test-Acc=0.9935390047601153


HBox(children=(IntProgress(value=0, max=47896), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15966), HTML(value='')))


Epoch=2 Train-Loss=0.010331087097482488 Train-Acc=0.9935556988057458 Test-Loss=2.2399340389098317e-11 Test-Acc=0.9935624921708631


HBox(children=(IntProgress(value=0, max=47896), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15966), HTML(value='')))


Epoch=3 Train-Loss=0.010113526382397207 Train-Acc=0.9935981084015366 Test-Loss=3.5465622282739007e-11 Test-Acc=0.9935272610547413


HBox(children=(IntProgress(value=0, max=47896), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15966), HTML(value='')))


Epoch=4 Train-Loss=0.010038480164660819 Train-Acc=0.992991977409387 Test-Loss=0.0 Test-Acc=0.9935938087185269


HBox(children=(IntProgress(value=0, max=47896), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15966), HTML(value='')))


Epoch=5 Train-Loss=0.01008124956153989 Train-Acc=0.9886029104726908 Test-Loss=7.466446796366107e-12 Test-Acc=0.9935977232869848


HBox(children=(IntProgress(value=0, max=47896), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15966), HTML(value='')))


Epoch=6 Train-Loss=0.009989051508764179 Train-Acc=0.9936607441122431 Test-Loss=0.0 Test-Acc=0.9936133815608167


HBox(children=(IntProgress(value=0, max=47896), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15966), HTML(value='')))


Epoch=7 Train-Loss=0.009846235851801407 Train-Acc=0.9936879297505707 Test-Loss=7.466446796366106e-11 Test-Acc=0.9936075097081297


## **Testing**
We are going to test our model on the test set

In [9]:
def test():
  ''' Used to test the model on the testing data'''
  global test_dataloader
  global trained_model

  device = torch.device("cuda")  

  # Evaluate the model
  loss_function = torch.nn.CrossEntropyLoss()
  test_loss, test_accuracy = evaluate_model(trained_model, loss_function, test_dataloader, device)

  print(f"Test loss={test_loss}, Test Accuracy={test_accuracy}")

test()

HBox(children=(IntProgress(value=0, max=16001), HTML(value='')))


Test loss=2.8081064123368606e-06, Test Accuracy=0.9941644272232986
