# Machine Translation Project (PyTorch Framework)

## Introduction
In this notebook, the machine translation end2end pipeline is implemented; two DL models are implemented with **PyTorch** Frameworks; the goal is to translate from English to French

- **Preprocess Pipeline** - Convert text to sequence of integers.
- **Model 1** Bi-Directional RNN with (GRU/LSTM) cells; the neural network includes word embedding, encoder/decoder
- **Model 2** Implement Attention Model with Bi-Directional RNN cells 
- **Prediction** Run the model on English text.

In [None]:
%load_ext autoreload
%aimport helper, tests
%autoreload 1

In [None]:
import collections

import helper
import numpy as np
import project_tests as tests

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

import torch
import torch.nn.functional as F
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
from torch.autograd import Variable
from torch.utils.data import DataLoader
from torch.utils.data import Dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import logger
import time
import os
import copy

### Verify access to the GPU

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

## Dataset
We begin by investigating the dataset that will be used to train and evaluate your pipeline.  The most common datasets used for machine translation are from [WMT](http://www.statmt.org/).  
### Load Data
The data is located in `data/small_vocab_en` and `data/small_vocab_fr`. The `small_vocab_en` file contains English sentences with their French translations in the `small_vocab_fr` file. Load the English and French data from these files from running the cell below.

In [None]:
# Load English data
english_sentences = helper.load_data('data\small_vocab_en')
# Load French data
french_sentences = helper.load_data('data\small_vocab_fr')

print('Dataset Loaded')

### Files
Each line in `small_vocab_en` contains an English sentence with the respective translation in each line of `small_vocab_fr`.  View the first two lines from each file.

In [None]:
for sample_i in range(2):
    print('small_vocab_en Line {}:  {}'.format(sample_i + 1, english_sentences[sample_i]))
    print('small_vocab_fr Line {}:  {}'.format(sample_i + 1, french_sentences[sample_i]))

In [None]:
len(english_sentences), len(french_sentences)

## Preprocess
Three steps in the text preprocess

- 1. **Vocabulary Creation**
- 2. **Tokenize** Implemented with Keras
- 3. **Padding to the same length** Implemented with Keras

The complexity of the problem is determined by the complexity of the vocabulary.  A more complex vocabulary is a more complex problem.  Let's look at the complexity of the dataset we'll be working with.

In [None]:
english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
french_words_counter = collections.Counter([word for sentence in french_sentences for word in sentence.split()])

print('{} English words.'.format(len([word for sentence in english_sentences for word in sentence.split()])))
print('{} unique English words.'.format(len(english_words_counter)))
print('10 Most common words in the English dataset:')
print('"' + '" "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"')
print()
print('{} French words.'.format(len([word for sentence in french_sentences for word in sentence.split()])))
print('{} unique French words.'.format(len(french_words_counter)))
print('10 Most common words in the French dataset:')
print('"' + '" "'.join(list(zip(*french_words_counter.most_common(10)))[0]) + '"')

In [None]:
def tokenize(x):
    """
    Tokenize x
    :param x: List of sentences/strings to be tokenized
    :return: Tuple of (tokenized x data, tokenizer used to tokenize x)
    """
    # TODO: Implement
    text_tokenizer = Tokenizer()
    text_tokenizer.fit_on_texts(x)
    text_tokenized = text_tokenizer.texts_to_sequences(x)
    
    return text_tokenized, text_tokenizer

tests.test_tokenize(tokenize)

# Tokenize Example output
text_sentences = [
    'The quick brown fox jumps over the lazy dog .',
    'By Jove , my quick study of lexicography won a prize .',
    'This is a short sentence .']
text_tokenized, text_tokenizer = tokenize(text_sentences)
print(text_tokenizer.word_index)
print()
for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(sent))
    print('  Output: {}'.format(token_sent))

In [None]:
def pad(x, length=None):
    """
    Pad x
    :param x: List of sequences.
    :param length: Length to pad the sequence to.  If None, use length of longest sequence in x.
    :return: Padded numpy array of sequences
    """
    # TODO: Implement
    max_length = 0
    
    if length!=None:
        max_length = length
    else:
        for i in x:
            if len(i) > max_length:
                max_length = len(i)
                
    return pad_sequences(x, maxlen=max_length, padding='post')

tests.test_pad(pad)

# Pad Tokenized output
test_pad = pad(text_tokenized)
for sample_i, (token_sent, pad_sent) in enumerate(zip(text_tokenized, test_pad)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(np.array(token_sent)))
    print('  Output: {}'.format(pad_sent))

In [None]:
def preprocess(x, y):
    """
    Preprocess x and y
    :param x: Feature List of sentences
    :param y: Label List of sentences
    :return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
    """
    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)

    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk

preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer =\
    preprocess(english_sentences, french_sentences)
    
max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)

print('Data Preprocessed')
print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)

## DataLoaders

To write the dataloaders format in Pytorch
- Split Dataset to Train and Validation and Test
- Applied into customized dataloaders

In [None]:
batch_size = 64

In [None]:
def train_valid_split(x_data, y_data, split_ratio=0.2):
    
    assert(x_data.shape[0] == y_data.shape[0])
    data_length = x_data.shape[0]
    index = np.random.permutation(data_length).tolist()
    train_data_x = x_data[index[0:int(data_length*(1-split_ratio))], :]
    train_data_y = y_data[index[0:int(data_length*(1-split_ratio))], :]
    test_data_x = x_data[index[int(data_length*(1-split_ratio)):-1], :]
    test_data_y = y_data[index[int(data_length*(1-split_ratio)):-1], :]
    train_data = (train_data_x, train_data_y)
    test_data = (test_data_x, test_data_y)
    
    return train_data, test_data

In [None]:
class TimeSeriesDataset(Dataset):
    ### Sequence Dataset
    
    def __init__(self, sequences_in, sequences_out):
        super().__init__()
        self.len = sequences_in.shape[0]
        self.x_data = torch.from_numpy(sequences_in).long()
        self.y_data = torch.from_numpy(sequences_out).long()
    
    def __getitem__(self, index):
        return self.x_data[index], self.y_data[index]
    
        
    def __len__(self):
        return self.len
        

In [None]:
train_data, test_data = train_valid_split(preproc_english_sentences, preproc_french_sentences, split_ratio=0.2)

In [None]:
test = train_data[1].reshape(-1, 21)
test.shape

In [None]:
train_dataset = TimeSeriesDataset(train_data[0], train_data[1].reshape(-1, max_french_sequence_length))
test_dataset = TimeSeriesDataset(test_data[0], test_data[1].reshape(-1, max_french_sequence_length))
print(len(train_dataset))
print(len(test_dataset))

In [None]:
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

In [None]:
dataset_size = {'train':len(train_dataset), 'test':len(test_dataset)}
dataset_size

In [None]:
dataloaders = dict()
dataloaders['train'] = train_loader
dataloaders['test'] = test_loader

## Models

- **Model 1** Bi-Directional RNN with (GRU/LSTM) cells; the neural network includes word embedding, encoder/decoder
- **Model 2** Implement Attention Model with Bi-Directional RNN cells 

After experimenting with the four simple architectures, you will construct a deeper architecture that is designed to outperform all four models.
### Ids Back to Text
The neural network will be translating the input to words ids, which isn't the final form we want.  We want the French translation.  The function `logits_to_text` will bridge the gab between the logits from the neural network to the French translation.  You'll be using this function to better understand the output of the neural network.


### Model 1


In [None]:
## Define help Model (Customized TimeDistributed Model)
class TimeDistributed(nn.Module):
    
    def __init__(self, module, batch_first=False):
        super().__init__()
        self.module = module
        self.batch_first = batch_first

    def forward(self, x):

        if len(x.size()) <= 2:
            return self.module(x)

        # Squash samples and timesteps into a single axis
        x_reshape = x.contiguous().view(-1, x.size(-1))  # (samples * timesteps, input_size)

        y = self.module(x_reshape)

        # We have to reshape Y
        if self.batch_first:
            y = y.contiguous().reshape(x.size(0), -1, y.size(-1))  # (samples, timesteps, output_size)
        else:
            y = y.view(-1, x.size(1), y.size(-1))  # (timesteps, samples, output_size)

        return y

In [None]:
# Define Feedthrough Model
class Model1(nn.Module):
    
    def __init__(self, english_vocab_size, french_vocab_size, max_french_sequence_length, 
                 embedding_dim=200, hidden_dim=100, rnn_module = nn.LSTM):
        
        super().__init__()
        self.emb_vector = nn.Embedding(english_vocab_size+1
                                       , embedding_dim)
        self.enc_rnn_1 = rnn_module(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=1, 
                                    batch_first=True, bidirectional=True)
        self.dec_rnn_1 = rnn_module(input_size=2*hidden_dim, hidden_size=hidden_dim, num_layers=1, 
                                    batch_first=True, bidirectional=True)
        #self.bn = nn.BatchNorm1d(2*hidden_dim)
        self.fc = nn.Linear(2*hidden_dim, french_vocab_size+1)
        #self.time_series = TimeDistributed(self.fc, batch_first=True)
        
    def forward(self, inputs):
        
        embd_inputs = self.emb_vector(inputs)
        en_out, en_hn = self.enc_rnn_1(embd_inputs)
        en_out_end = en_out[:, -1]
        decode_inputs = en_out_end.view(en_out_end.size()[0], 1, -1)
        decode_inputs = decode_inputs.repeat(1, max_french_sequence_length, 1)
        de_out, dn_hn = self.dec_rnn_1(decode_inputs)
        # Add Batch Norm 
        bn_1 = nn.BatchNorm1d(de_out.shape[2])
        b_de_out = bn_1(de_out.contiguous().view(de_out.shape[0], de_out.shape[2], de_out.shape[1])) 
        #shape=(batch, catogories, time-series-length)
        # softmax along the catogories axis
        outputs = F.softmax(self.fc(b_de_out.view(de_out.shape[0], de_out.shape[1], de_out.shape[2])), 
                            dim=2) #shape=(batch, time-series-length, catogories)
        #outputs = F.softmax(self.time_series(b_de_out.contiguous().
        #                                    reshape(de_out.shape[0], de_out.shape[1], de_out.shape[2])), dim=2)
        #outputs = F.softmax(TimeDistributed(self.fc, batch_first=True)(de_out), dim=2)
        
        return outputs.view(-1, outputs.shape[2], outputs.shape[1]) #shape=(batch, catogories, time-series-length)


### Model 2


In [None]:
## Define help Model (Customized TimeDistributed Model)
class TimeDistributed(nn.Module):
    
    def __init__(self, module, batch_first=False):
        super().__init__()
        self.module = module
        self.batch_first = batch_first

    def forward(self, x):

        if len(x.size()) <= 2:
            return self.module(x)

        # Squash samples and timesteps into a single axis
        x_reshape = x.contiguous().view(-1, x.size(-1))  # (samples * timesteps, input_size)

        y = self.module(x_reshape)

        # We have to reshape Y
        if self.batch_first:
            y = y.contiguous().view(x.size(0), -1, y.size(-1))  # (samples, timesteps, output_size)
        else:
            y = y.view(-1, x.size(1), y.size(-1))  # (timesteps, samples, output_size)

        return y

In [None]:
## Define Scoring Function
def score_multiply(hx, enc_h):
    score = F.softmax(torch.matmul(enc_h, 
                                   hx.view(hx.shape[0], hx.shape[1], 1)),
                      dim=1)
    batch = score.shape[0]
    seq_length = score.shape[1]
    enc_h_new = torch.mul(enc_h, score)
    atten_vec = torch.sum(enc_h_new, dim=1)
    
    return atten_vec

## Define Attention_Decoder_Model
class Attention_Decode(nn.Module):
    
    def __init__(self, enc_h, input_dim, hidden_dim, rnn_module=nn.LSTMCell):
        
        # enc_h is encoded hidden tensor 
        super().__init__()
        self.input_dim = input_dim
        self.enc_h = enc_h
        self.dec_rnn = rnn_module(input_dim, hidden_dim)
    
    def forward(self, ini_x, ini_hc):
        
        # ini_x is the tensor for the initial input 
        # ini_hc is the tensor for the initial hidden layer; LSTMcell will be h and c states
        # hx, cx = LSTMcell(ini_x, ini_hc)
        # atten_vec = score(hx, enc_h) 
        # Glued atten_vec to hx => cat(atten_vec, hx)
        # x_next = tanh(wc[Glued_vector])
        # hc_next = (hx, cx)
        # foward (x_next, h_next) to create the next layer (repeat length times)
        
        hx, cx = self.dec_rnn(ini_x, (ini_hc[0], ini_hc[1]))
        atten_vec = score_multiply(hx, self.enc_h)
        glued_vector = torch.cat((atten_vec, hx), dim=1)
        x_next = F.tanh(nn.Linear(glued_vector.shape[1], self.input_dim)(glued_vector))
        hc_next = (hx, cx)
  
        return x_next, hc_next

In [None]:
# Define Feedthrough Model
class Model2(nn.Module):
    
    def __init__(self, english_vocab_size, french_vocab_size, max_french_sequence_length, 
                 embedding_dim=200, hidden_dim=100, rnn_encode_module=nn.LSTM, rnn_decode_cell=nn.LSTMCell):
        
        super().__init__()
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.emb_vector = nn.Embedding(english_vocab_size+1, embedding_dim)
        self.enc_rnn_1 = rnn_encode_module(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=1, 
                                    batch_first=True, bidirectional=True)
        self.rnn_decode_cell = rnn_decode_cell
        self.fc = nn.Linear(embedding_dim, french_vocab_size+1)
        #self.time_series = TimeDistributed(self.fc, batch_first=True)
        
    def forward(self, inputs):
        
        embd_inputs = self.emb_vector(inputs)
        en_out, en_hn = self.enc_rnn_1(embd_inputs)
        enc_h = en_out # Treat the output as the hidden state
        #Initiate Attention Decode Module        
        Decode_process = nn.ModuleList([Attention_Decode(enc_h, input_dim=self.embedding_dim, hidden_dim=self.hidden_dim*2, 
                                                         rnn_module=nn.LSTMCell) for i in range(max_french_sequence_length)])
        # 1st x_next and h_next
        # x_next is <END>
        # hc_next is zero tensor with the correct dimension 
        h_0 = torch.zeros((inputs.shape[0], self.hidden_dim*2))
        c_0 = torch.zeros((inputs.shape[0], self.hidden_dim*2))
        # i_0 should implement with the <END> embed matrix (batch, embd_dim)
        i_0 = self.emb_vector(torch.zeros((inputs.shape[0])).type(torch.LongTensor))
        x_next = i_0
        hc_next = (h_0, c_0)
        x_sequence = list()
        hc_sequence = list() # h contains (h, c)
        
        for i, decode in enumerate(Decode_process):
            x_next, hc_next = decode(x_next, hc_next)
            x_sequence.append(x_next)
            hc_sequence.append(hc_next)
        
        # stack x_sequence and hc_sequence
        x_sequence = [i.view(i.shape[0], 1, -1) for i in x_sequence]
        outputs_c = torch.cat(x_sequence, dim=1)
        # BatchNorm
        bn_1 = nn.BatchNorm1d(outputs_c.shape[2])
        outputs_c_bn = bn_1(outputs_c.contiguous().view(outputs_c.shape[0], outputs_c.shape[2], outputs_c.shape[1])) 
        # softmax along the final matrix
        #
        outputs = F.softmax(self.fc(outputs_c_bn.view
                                    (outputs_c.shape[0], outputs_c.shape[1], outputs_c.shape[2])), dim=2)    
        #outputs = F.softmax(self.time_series(outputs_c_bn.view
                                             #(outputs_c.shape[0], outputs_c.shape[1], outputs_c.shape[2])), dim=2)
        
        return outputs.view(-1, outputs.shape[2], outputs.shape[1])



## Loss/Accuracy and Optimization/LRrate Function Setting


In [None]:
model = Model2(english_vocab_size, french_vocab_size, max_french_sequence_length, 
                 embedding_dim=200, hidden_dim=100, rnn_encode_module=nn.LSTM, rnn_decode_cell=nn.LSTMCell)

In [None]:
model = model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer_ft = optim.Adam(model.parameters(), lr=5e-3)

In [None]:
def lr_decay(epoch):
    lr_matrix = np.ones(18)
    lr_matrix[0:12] = 5e-3
    lr_matrix[12:16] = 2e-3
    lr_matrix[16:] = 1e-3
    #lr_matrix[17:] = 0.8e-3
    
    return lr_matrix[epoch]


In [None]:
exp_lr_scheduler = lr_scheduler.LambdaLR(optimizer=optimizer_ft, lr_lambda=lr_decay, last_epoch=-1)

## Training & Validation Loop


Check Input and Output Tensor Shape

In [None]:
a = list(dataloaders['train'])
b = list(dataloaders['test'])
print(len(a), len(b))

In [None]:
input, label = a[0]

In [None]:
print(input[0].view(1,-1).shape)
print(label[0].view(1,-1).shape)

In [None]:
output = model(input)

In [None]:
output.shape

In [None]:
loss = criterion(output, label)
loss

### define training procedure

In [None]:
def mac_train_model(model, criterion, optimizer, scheduler, num_epochs=18):
    since = time.time()

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 24)

        # Each epoch has a training and validation phase
        for phase in ['train', 'test']:
            if phase == 'train':
                scheduler.step()
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode; won't alter to different dropout and BatchNorm weights

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # turn on the tracking history in train and turn off the tracking history in others
                with torch.set_grad_enabled(phase == 'train'): #set gradient calculation enabled
                    outputs = model(inputs)
                    max_tensor, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.shape[0]
                running_corrects += torch.sum(preds == labels.data)

            epoch_loss = running_loss / dataset_size[phase]
            epoch_acc = running_corrects.double() / dataset_size[phase]

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model


## Train


In [None]:
model_1 = mac_train_model(model=model, criterion=criterion, optimizer=optimizer_ft, 
                          scheduler=exp_lr_scheduler, num_epochs=18)

## Validation

In [None]:
x_tk = english_tokenizer
y_tk = french_tokenizer
x_id_to_word = {value: key for key, value in x_tk.word_index.items()}
y_id_to_word = {value: key for key, value in y_tk.word_index.items()}
x_id_to_word[0] = '<PAD>'
y_id_to_word[0] = '<PAD>'

In [None]:
len(x_id_to_word)

In [None]:
index = 0
pred_input = input[index].view(1,-1) 
#print(pred_input.shape)
pred_input = pred_input.numpy()
pred_input.reshape(-1)

output = test_1(input[index].view(1,-1))

pred_output = output.data.view(french_vocab_size, max_french_sequence_length).numpy()
#print(pred_output.shape)
pred_output[0]

target_output = label[index].numpy()

x_tk = english_tokenizer
y_tk = french_tokenizer
x_id_to_word = {value: key for key, value in x_tk.word_index.items()}
y_id_to_word = {value: key for key, value in y_tk.word_index.items()}
x_id_to_word[0] = '<PAD>'
y_id_to_word[0] = '<PAD>'
x_id_to_word[pred_input.reshape(-1)[0]]
print(' '.join([x_id_to_word[x] for x in pred_input.reshape(-1)]))
print(' '.join([y_id_to_word[x] for x in np.argmax(pred_output, axis=0)]))
print('correct French sentence')
print(' '.join([x_id_to_word[x] for x in target_output]))


### Generate the html

**Save your notebook before running the next cell to generate the HTML output.** Then submit your project.

In [None]:
# Save before you run this cell!
!!jupyter nbconvert *.ipynb

## Optional Enhancements

This project focuses on learning various network architectures for machine translation, but we don't evaluate the models according to best practices by splitting the data into separate test & training sets -- so the model accuracy is overstated. Use the [`sklearn.model_selection.train_test_split()`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to create separate training & test datasets, then retrain each of the models using only the training set and evaluate the prediction accuracy using the hold out test set. Does the "best" model change?