# Machine Translation Project (PyTorch Framework)

## Introduction
In this notebook, the machine translation end2end pipeline is implemented; two DL models are implemented with **PyTorch** Frameworks; the goal is to translate from English to French

- **Preprocess Pipeline** - Convert text to sequence of integers.
- **Model 1** Bi-Directional RNN with (GRU/LSTM) cells; the neural network includes word embedding, encoder/decoder
- **Model 2** Implement Attention Model with Bi-Directional RNN cells 
- **Prediction** Run the model on English text.

In [1]:
%load_ext autoreload
%aimport helper, tests
%autoreload 1

In [2]:
import collections

import helper
import numpy as np
import project_tests as tests

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

import torch
import torch.nn.functional as F
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
from torch.autograd import Variable
from torch.utils.data import DataLoader
from torch.utils.data import Dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#import logger
import time
import os
import copy

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


### Verify access to the GPU

In [3]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

## Dataset
We begin by investigating the dataset that will be used to train and evaluate your pipeline.  The most common datasets used for machine translation are from [WMT](http://www.statmt.org/).  
### Load Data
The data is located in `data/small_vocab_en` and `data/small_vocab_fr`. The `small_vocab_en` file contains English sentences with their French translations in the `small_vocab_fr` file. Load the English and French data from these files from running the cell below.

In [4]:
# Load English data
english_sentences = helper.load_data('data\small_vocab_en')
# Load French data
french_sentences = helper.load_data('data\small_vocab_fr')

print('Dataset Loaded')

Dataset Loaded


### Files
Each line in `small_vocab_en` contains an English sentence with the respective translation in each line of `small_vocab_fr`.  View the first two lines from each file.

In [5]:
for sample_i in range(2):
    print('small_vocab_en Line {}:  {}'.format(sample_i + 1, english_sentences[sample_i]))
    print('small_vocab_fr Line {}:  {}'.format(sample_i + 1, french_sentences[sample_i]))

small_vocab_en Line 1:  new jersey is sometimes quiet during autumn , and it is snowy in april .
small_vocab_fr Line 1:  new jersey est parfois calme pendant l' automne , et il est neigeux en avril .
small_vocab_en Line 2:  the united states is usually chilly during july , and it is usually freezing in november .
small_vocab_fr Line 2:  les Ã©tats-unis est gÃ©nÃ©ralement froid en juillet , et il gÃ¨le habituellement en novembre .


In [6]:
len(english_sentences), len(french_sentences)

(137861, 137861)

## Preprocess
Three steps in the text preprocess

- 1. **Vocabulary Creation**
- 2. **Tokenize** Implemented with Keras
- 3. **Padding to the same length** Implemented with Keras

The complexity of the problem is determined by the complexity of the vocabulary.  A more complex vocabulary is a more complex problem.  Let's look at the complexity of the dataset we'll be working with.

In [7]:
english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
french_words_counter = collections.Counter([word for sentence in french_sentences for word in sentence.split()])

print('{} English words.'.format(len([word for sentence in english_sentences for word in sentence.split()])))
print('{} unique English words.'.format(len(english_words_counter)))
print('10 Most common words in the English dataset:')
print('"' + '" "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"')
print()
print('{} French words.'.format(len([word for sentence in french_sentences for word in sentence.split()])))
print('{} unique French words.'.format(len(french_words_counter)))
print('10 Most common words in the French dataset:')
print('"' + '" "'.join(list(zip(*french_words_counter.most_common(10)))[0]) + '"')

1823250 English words.
227 unique English words.
10 Most common words in the English dataset:
"is" "," "." "in" "it" "during" "the" "but" "and" "sometimes"

1961295 French words.
355 unique French words.
10 Most common words in the French dataset:
"est" "." "," "en" "il" "les" "mais" "et" "la" "parfois"


In [8]:
def tokenize(x):
    """
    Tokenize x
    :param x: List of sentences/strings to be tokenized
    :return: Tuple of (tokenized x data, tokenizer used to tokenize x)
    """
    # TODO: Implement
    text_tokenizer = Tokenizer()
    text_tokenizer.fit_on_texts(x)
    text_tokenized = text_tokenizer.texts_to_sequences(x)
    
    return text_tokenized, text_tokenizer

tests.test_tokenize(tokenize)

# Tokenize Example output
text_sentences = [
    'The quick brown fox jumps over the lazy dog .',
    'By Jove , my quick study of lexicography won a prize .',
    'This is a short sentence .']
text_tokenized, text_tokenizer = tokenize(text_sentences)
print(text_tokenizer.word_index)
print()
for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(sent))
    print('  Output: {}'.format(token_sent))

{'the': 1, 'quick': 2, 'a': 3, 'brown': 4, 'fox': 5, 'jumps': 6, 'over': 7, 'lazy': 8, 'dog': 9, 'by': 10, 'jove': 11, 'my': 12, 'study': 13, 'of': 14, 'lexicography': 15, 'won': 16, 'prize': 17, 'this': 18, 'is': 19, 'short': 20, 'sentence': 21}

Sequence 1 in x
  Input:  The quick brown fox jumps over the lazy dog .
  Output: [1, 2, 4, 5, 6, 7, 1, 8, 9]
Sequence 2 in x
  Input:  By Jove , my quick study of lexicography won a prize .
  Output: [10, 11, 12, 2, 13, 14, 15, 16, 3, 17]
Sequence 3 in x
  Input:  This is a short sentence .
  Output: [18, 19, 3, 20, 21]


In [9]:
def pad(x, length=None):
    """
    Pad x
    :param x: List of sequences.
    :param length: Length to pad the sequence to.  If None, use length of longest sequence in x.
    :return: Padded numpy array of sequences
    """
    # TODO: Implement
    max_length = 0
    
    if length!=None:
        max_length = length
    else:
        for i in x:
            if len(i) > max_length:
                max_length = len(i)
                
    return pad_sequences(x, maxlen=max_length, padding='post')

tests.test_pad(pad)

# Pad Tokenized output
test_pad = pad(text_tokenized)
for sample_i, (token_sent, pad_sent) in enumerate(zip(text_tokenized, test_pad)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(np.array(token_sent)))
    print('  Output: {}'.format(pad_sent))

Sequence 1 in x
  Input:  [1 2 4 5 6 7 1 8 9]
  Output: [1 2 4 5 6 7 1 8 9 0]
Sequence 2 in x
  Input:  [10 11 12  2 13 14 15 16  3 17]
  Output: [10 11 12  2 13 14 15 16  3 17]
Sequence 3 in x
  Input:  [18 19  3 20 21]
  Output: [18 19  3 20 21  0  0  0  0  0]


In [10]:
def preprocess(x, y):
    """
    Preprocess x and y
    :param x: Feature List of sentences
    :param y: Label List of sentences
    :return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
    """
    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)

    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk

preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer =\
    preprocess(english_sentences, french_sentences)
    
max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)

print('Data Preprocessed')
print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)

Data Preprocessed
Max English sentence length: 15
Max French sentence length: 21
English vocabulary size: 199
French vocabulary size: 345


## DataLoaders

To write the dataloaders format in Pytorch
- Split Dataset to Train and Validation and Test
- Applied into customized dataloaders

In [11]:
batch_size = 1024

In [12]:
def train_valid_split(x_data, y_data, split_ratio=0.2):
    
    assert(x_data.shape[0] == y_data.shape[0])
    data_length = x_data.shape[0]
    index = np.random.permutation(data_length).tolist()
    train_data_x = x_data[index[0:int(data_length*(1-split_ratio))], :]
    train_data_y = y_data[index[0:int(data_length*(1-split_ratio))], :]
    test_data_x = x_data[index[int(data_length*(1-split_ratio)):-1], :]
    test_data_y = y_data[index[int(data_length*(1-split_ratio)):-1], :]
    train_data = (train_data_x, train_data_y)
    test_data = (test_data_x, test_data_y)
    
    return train_data, test_data

In [13]:
class TimeSeriesDataset(Dataset):
    ### Sequence Dataset
    
    def __init__(self, sequences_in, sequences_out):
        super().__init__()
        self.len = sequences_in.shape[0]
        self.x_data = torch.from_numpy(sequences_in).long()
        self.y_data = torch.from_numpy(sequences_out).long()
    
    def __getitem__(self, index):
        return self.x_data[index], self.y_data[index]
    
        
    def __len__(self):
        return self.len
        

In [14]:
train_data, test_data = train_valid_split(preproc_english_sentences, preproc_french_sentences, split_ratio=0.2)

In [15]:
test = train_data[1].reshape(-1, 21)
test.shape

(110288, 21)

In [16]:
train_dataset = TimeSeriesDataset(train_data[0], train_data[1].reshape(-1, max_french_sequence_length))
test_dataset = TimeSeriesDataset(test_data[0], test_data[1].reshape(-1, max_french_sequence_length))
print(len(train_dataset))
print(len(test_dataset))

110288
27572


In [17]:
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

In [18]:
dataset_size = {'train':len(train_dataset), 'test':len(test_dataset)}
dataset_size

{'test': 27572, 'train': 110288}

In [19]:
dataloaders = dict()
dataloaders['train'] = train_loader
dataloaders['test'] = test_loader

## Models

- **Model 1** Bi-Directional RNN with (GRU/LSTM) cells; the neural network includes word embedding, encoder/decoder
- **Model 2** Implement Attention Model with Bi-Directional RNN cells 

After experimenting with the four simple architectures, you will construct a deeper architecture that is designed to outperform all four models.
### Ids Back to Text
The neural network will be translating the input to words ids, which isn't the final form we want.  We want the French translation.  The function `logits_to_text` will bridge the gab between the logits from the neural network to the French translation.  You'll be using this function to better understand the output of the neural network.


### Model 1


In [20]:
## Define help Model (Customized TimeDistributed Model)
class TimeDistributed(nn.Module):
    
    def __init__(self, module, batch_first=False):
        super().__init__()
        self.module = module
        self.batch_first = batch_first

    def forward(self, x):

        if len(x.size()) <= 2:
            return self.module(x)

        # Squash samples and timesteps into a single axis
        x_reshape = x.contiguous().view(-1, x.size(-1))  # (samples * timesteps, input_size)

        y = self.module(x_reshape)

        # We have to reshape Y
        if self.batch_first:
            y = y.contiguous().reshape(x.size(0), -1, y.size(-1))  # (samples, timesteps, output_size)
        else:
            y = y.view(-1, x.size(1), y.size(-1))  # (timesteps, samples, output_size)

        return y

In [25]:
# Define Feedthrough Model
class Model1(nn.Module):
    
    def __init__(self, english_vocab_size, french_vocab_size, max_french_sequence_length, 
                 embedding_dim=200, hidden_dim=100, rnn_module = nn.LSTM):
        
        super().__init__()
        self.emb_vector = nn.Embedding(english_vocab_size+1
                                       , embedding_dim)
        self.enc_rnn_1 = rnn_module(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=1, 
                                    batch_first=True, bidirectional=True)
        self.dec_rnn_1 = rnn_module(input_size=2*hidden_dim, hidden_size=hidden_dim, num_layers=1, 
                                    batch_first=True, bidirectional=True)
        #self.bn = nn.BatchNorm1d(2*hidden_dim)
        self.fc = nn.Linear(2*hidden_dim, french_vocab_size+1)
        #self.time_series = TimeDistributed(self.fc, batch_first=True)
        
    def forward(self, inputs):
        
        embd_inputs = self.emb_vector(inputs)
        en_out, en_hn = self.enc_rnn_1(embd_inputs)
        en_out_end = en_out[:, -1]
        decode_inputs = en_out_end.view(en_out_end.size()[0], 1, -1)
        decode_inputs = decode_inputs.repeat(1, max_french_sequence_length, 1)
        de_out, dn_hn = self.dec_rnn_1(decode_inputs)
        # Add Batch Norm 
        bn_1 = nn.BatchNorm1d(de_out.shape[2]).to(device)
        b_de_out = bn_1(de_out.contiguous().view(de_out.shape[0], de_out.shape[2], de_out.shape[1])) 
        #shape=(batch, catogories, time-series-length)
        # softmax along the catogories axis
        outputs = self.fc(b_de_out.view(de_out.shape[0], de_out.shape[1], de_out.shape[2]))
        #outputs = F.softmax(self.fc(b_de_out.view(de_out.shape[0], de_out.shape[1], de_out.shape[2])), 
                            #dim=2) #shape=(batch, time-series-length, catogories)
        #outputs = F.softmax(self.time_series(b_de_out.contiguous().
        #                                    reshape(de_out.shape[0], de_out.shape[1], de_out.shape[2])), dim=2)
        #outputs = F.softmax(TimeDistributed(self.fc, batch_first=True)(de_out), dim=2)
        #outputs = torch.exp(outputs)
        
        return outputs.view(-1, outputs.shape[2], outputs.shape[1]) #shape=(batch, catogories, time-series-length)


### Model 2


In [22]:
## Define help Model (Customized TimeDistributed Model)
class TimeDistributed(nn.Module):
    
    def __init__(self, module, batch_first=False):
        super().__init__()
        self.module = module
        self.batch_first = batch_first

    def forward(self, x):

        if len(x.size()) <= 2:
            return self.module(x)

        # Squash samples and timesteps into a single axis
        x_reshape = x.contiguous().view(-1, x.size(-1))  # (samples * timesteps, input_size)

        y = self.module(x_reshape)

        # We have to reshape Y
        if self.batch_first:
            y = y.contiguous().view(x.size(0), -1, y.size(-1))  # (samples, timesteps, output_size)
        else:
            y = y.view(-1, x.size(1), y.size(-1))  # (timesteps, samples, output_size)

        return y

In [23]:
## Define Scoring Function
def score_multiply(hx, enc_h):
    score = F.softmax(torch.matmul(enc_h, 
                                   hx.view(hx.shape[0], hx.shape[1], 1)),
                      dim=1)
    batch = score.shape[0]
    seq_length = score.shape[1]
    enc_h_new = torch.mul(enc_h, score)
    atten_vec = torch.sum(enc_h_new, dim=1)
    
    return atten_vec

## Define Attention_Decoder_Model
class Attention_Decode(nn.Module):
    
    def __init__(self, enc_h, input_dim, hidden_dim, rnn_module=nn.LSTMCell):
        
        # enc_h is encoded hidden tensor 
        super().__init__()
        self.input_dim = input_dim
        self.enc_h = enc_h
        self.dec_rnn = rnn_module(input_dim, hidden_dim)
    
    def forward(self, ini_x, ini_hc):
        
        # ini_x is the tensor for the initial input 
        # ini_hc is the tensor for the initial hidden layer; LSTMcell will be h and c states
        # hx, cx = LSTMcell(ini_x, ini_hc)
        # atten_vec = score(hx, enc_h) 
        # Glued atten_vec to hx => cat(atten_vec, hx)
        # x_next = tanh(wc[Glued_vector])
        # hc_next = (hx, cx)
        # foward (x_next, h_next) to create the next layer (repeat length times)
        
        hx, cx = self.dec_rnn(ini_x, (ini_hc[0], ini_hc[1]))
        atten_vec = score_multiply(hx, self.enc_h)
        glued_vector = torch.cat((atten_vec, hx), dim=1)
        x_next = F.tanh(nn.Linear(glued_vector.shape[1], self.input_dim)(glued_vector))
        hc_next = (hx, cx)
  
        return x_next, hc_next

In [24]:
# Define Feedthrough Model
class Model2(nn.Module):
    
    def __init__(self, english_vocab_size, french_vocab_size, max_french_sequence_length, 
                 embedding_dim=200, hidden_dim=100, rnn_encode_module=nn.LSTM, rnn_decode_cell=nn.LSTMCell):
        
        super().__init__()
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.emb_vector = nn.Embedding(english_vocab_size+1, embedding_dim)
        self.enc_rnn_1 = rnn_encode_module(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=1, 
                                    batch_first=True, bidirectional=True)
        self.rnn_decode_cell = rnn_decode_cell
        self.fc = nn.Linear(embedding_dim, french_vocab_size+1)
        #self.time_series = TimeDistributed(self.fc, batch_first=True)
        
    def forward(self, inputs):
        
        embd_inputs = self.emb_vector(inputs)
        en_out, en_hn = self.enc_rnn_1(embd_inputs)
        enc_h = en_out # Treat the output as the hidden state
        #Initiate Attention Decode Module        
        Decode_process = nn.ModuleList([Attention_Decode(enc_h, input_dim=self.embedding_dim, hidden_dim=self.hidden_dim*2, 
                                                         rnn_module=nn.LSTMCell).to(device) 
                                        for i in range(max_french_sequence_length)])
        # 1st x_next and h_next
        # x_next is <END>
        # hc_next is zero tensor with the correct dimension 
        h_0 = torch.zeros((inputs.shape[0], self.hidden_dim*2))
        c_0 = torch.zeros((inputs.shape[0], self.hidden_dim*2))
        # i_0 should implement with the <END> embed matrix (batch, embd_dim)
        i_0 = self.emb_vector(torch.zeros((inputs.shape[0])).type(torch.LongTensor))
        x_next = i_0
        hc_next = (h_0, c_0)
        x_sequence = list()
        hc_sequence = list() # h contains (h, c)
        
        for i, decode in enumerate(Decode_process):
            x_next, hc_next = decode(x_next, hc_next)
            x_sequence.append(x_next)
            hc_sequence.append(hc_next)
        
        # stack x_sequence and hc_sequence
        x_sequence = [i.view(i.shape[0], 1, -1) for i in x_sequence]
        outputs_c = torch.cat(x_sequence, dim=1)
        # BatchNorm
        bn_1 = nn.BatchNorm1d(outputs_c.shape[2]).to(device)
        outputs_c_bn = bn_1(outputs_c.contiguous().view(outputs_c.shape[0], outputs_c.shape[2], outputs_c.shape[1])) 
        # softmax along the final matrix
        #
        outputs = F.softmax(self.fc(outputs_c_bn.view
                                    (outputs_c.shape[0], outputs_c.shape[1], outputs_c.shape[2])), dim=2)    
        #outputs = F.softmax(self.time_series(outputs_c_bn.view
                                             #(outputs_c.shape[0], outputs_c.shape[1], outputs_c.shape[2])), dim=2)
        
        return outputs.view(-1, outputs.shape[2], outputs.shape[1])



## Loss/Accuracy and Optimization/LRrate Function Setting


In [27]:
model = Model1(english_vocab_size, french_vocab_size, max_french_sequence_length, 
                 embedding_dim=200, hidden_dim=100, rnn_module = nn.LSTM)

In [28]:
model = model.to(device)
#criterion = nn.CrossEntropyLoss()
criterion = nn.CrossEntropyLoss()
optimizer_adam = optim.Adam(model.parameters(), lr=5e-3)

### Learning Rate Scheduler

In [30]:
def lr_decay(epoch):
    lr_matrix = np.ones(1600)
    lr_matrix[0:400] = 8e-3
    lr_matrix[400:800] = 5e-3
    lr_matrix[800:1200] = 2.5e-3
    lr_matrix[1200:] = 1e-3
    
    return lr_matrix[epoch]

exp_lr_scheduler = lr_scheduler.LambdaLR(optimizer=optimizer_ft, lr_lambda=lr_decay, last_epoch=-1)

In [31]:
optimizer_sgd = optim.SGD(model.parameters(), lr=0.2)
exp_lr_scheduler_anneal = lr_scheduler.CosineAnnealingLR(optimizer=optimizer_ft, 
                                                         T_max=400, 
                                                         eta_min=2e-3, 
                                                         last_epoch=-1)

### Learning_Rate Finder (Under Development)

In [None]:
def mac_lr_finder(model, criterion, lr_logrange=[-5, 0]):
    
    # numbers of iterations
    n_iter = len(list(dataloaders['train'])) #108
    
    # lr rate matrix 
    lr_matrix = np.logspace(lr_logrange[0], lr_logrange[1], n_iter)
    loss_matrix = np.zeros(n_iter)
    
    for epoch in range(1):
        
        # Each epoch has a training and validation phase
        for phase in ['train']:
            if phase == 'train':
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode; won't alter to different dropout and BatchNorm weights

            idx = 0
            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                idx = idx+1
                inputs = inputs.to(device)
                labels = labels.to(device)
                
                # Set Optimizer
                optimizer = optim.Adam(model.parameters(), lr=lr_matrix[idx])
                
                # zero the parameter gradients
                optimizer.zero_grad()
                
                # forward
                # turn on the tracking history in train and turn off the tracking history in others
                with torch.set_grad_enabled(phase == 'train'): #set gradient calculation enabled
                    outputs = model(inputs)
                    loss = criterion(outputs, labels)
                    loss_matrix[idx] = loss.item()

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()
                        
    lr_matrix = lr_matrix.reshape((lr_matrix.shape[0], 1))
    loss_matrix = lr_matrix.reshape((loss_matrix.shape[0], 1))

    return np.concatenate((lr_matrix, loss_matrix), axis=1)


## Training & Validation Loop


Check Input and Output Tensor Shape

In [32]:
a = list(dataloaders['train'])
b = list(dataloaders['test'])
print(len(a), len(b))

108 27


In [33]:
input, label = a[0]

In [34]:
print(input[0].view(1,-1).shape)
print(label[0].view(1,-1).shape)

torch.Size([1, 15])
torch.Size([1, 21])


In [35]:
output = model(input.to(device))

In [36]:
output.shape

torch.Size([1024, 346, 21])

### define training procedure

In [172]:
def mac_train_model(model, criterion, optimizer, scheduler, num_epochs=18):
    since = time.time()
    
    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 24)

        # Each epoch has a training and validation phase
        for phase in ['train', 'test']:
            if phase == 'train':
                scheduler.step()
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode; won't alter to different dropout and BatchNorm weights

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # turn on the tracking history in train and turn off the tracking history in others
                with torch.set_grad_enabled(phase == 'train'): #set gradient calculation enabled
                    outputs = model(inputs)
                    max_tensor, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.shape[0]
                running_corrects += torch.sum(preds == labels.data)

            epoch_loss = running_loss / dataset_size[phase]
            epoch_acc = running_corrects.double()*100 / (dataset_size[phase]*max_french_sequence_length)

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'test' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model


## Train


In [176]:
model_1 = mac_train_model(model=model, criterion=criterion, optimizer=optimizer_ft, 
                          scheduler=exp_lr_scheduler, num_epochs=500)

Epoch 0/499
------------------------
train Loss: 2.0889 Acc: 57.4691
test Loss: 2.0622 Acc: 57.8432

Epoch 1/499
------------------------
train Loss: 2.0400 Acc: 58.1501
test Loss: 1.9812 Acc: 59.1575

Epoch 2/499
------------------------
train Loss: 1.9831 Acc: 58.8583
test Loss: 2.0562 Acc: 57.2418

Epoch 3/499
------------------------
train Loss: 1.9702 Acc: 58.9069
test Loss: 2.2273 Acc: 57.6183

Epoch 4/499
------------------------
train Loss: 1.9737 Acc: 58.9849
test Loss: 2.0391 Acc: 57.3133

Epoch 5/499
------------------------
train Loss: 2.0185 Acc: 58.3969
test Loss: 1.9381 Acc: 59.8364

Epoch 6/499
------------------------
train Loss: 2.0330 Acc: 58.2666
test Loss: 2.1571 Acc: 55.8177

Epoch 7/499
------------------------
train Loss: 1.9774 Acc: 58.8879
test Loss: 1.9662 Acc: 58.1889

Epoch 8/499
------------------------
train Loss: 1.9759 Acc: 58.9838
test Loss: 2.0064 Acc: 57.4871

Epoch 9/499
------------------------
train Loss: 1.9860 Acc: 58.4675
test Loss: 2.0220 Acc:

test Loss: 1.7043 Acc: 64.3876

Epoch 81/499
------------------------
train Loss: 1.6790 Acc: 64.5796
test Loss: 1.6445 Acc: 64.4263

Epoch 82/499
------------------------
train Loss: 1.6678 Acc: 64.8705
test Loss: 1.6100 Acc: 65.6313

Epoch 83/499
------------------------
train Loss: 1.6621 Acc: 65.0356
test Loss: 1.6570 Acc: 65.8021

Epoch 84/499
------------------------
train Loss: 1.6636 Acc: 64.6202
test Loss: 1.6214 Acc: 65.2487

Epoch 85/499
------------------------
train Loss: 1.6341 Acc: 65.2080
test Loss: 1.5757 Acc: 66.3081

Epoch 86/499
------------------------
train Loss: 1.6386 Acc: 64.9863
test Loss: 1.6548 Acc: 65.7361

Epoch 87/499
------------------------
train Loss: 1.6705 Acc: 65.0242
test Loss: 1.6805 Acc: 65.3252

Epoch 88/499
------------------------
train Loss: 1.6093 Acc: 65.9225
test Loss: 1.6012 Acc: 65.5657

Epoch 89/499
------------------------
train Loss: 1.6431 Acc: 65.6630
test Loss: 1.6432 Acc: 64.9788

Epoch 90/499
------------------------
train Loss: 

test Loss: 1.4592 Acc: 69.2297

Epoch 161/499
------------------------
train Loss: 1.3506 Acc: 71.5027
test Loss: 1.3659 Acc: 70.9182

Epoch 162/499
------------------------
train Loss: 1.3663 Acc: 71.0012
test Loss: 1.3826 Acc: 71.2908

Epoch 163/499
------------------------
train Loss: 1.3846 Acc: 70.9308
test Loss: 1.3608 Acc: 70.9175

Epoch 164/499
------------------------
train Loss: 1.3349 Acc: 71.6664
test Loss: 1.3807 Acc: 71.1704

Epoch 165/499
------------------------
train Loss: 1.3723 Acc: 71.1272
test Loss: 1.3202 Acc: 72.0596

Epoch 166/499
------------------------
train Loss: 1.3700 Acc: 71.3790
test Loss: 1.4031 Acc: 70.7590

Epoch 167/499
------------------------
train Loss: 1.3574 Acc: 71.3713
test Loss: 1.3995 Acc: 70.4260

Epoch 168/499
------------------------
train Loss: 1.3970 Acc: 70.8956
test Loss: 1.3407 Acc: 71.4818

Epoch 169/499
------------------------
train Loss: 1.3312 Acc: 71.8952
test Loss: 1.2846 Acc: 72.2078

Epoch 170/499
------------------------
tr

train Loss: 1.1602 Acc: 75.9003
test Loss: 1.1540 Acc: 75.5943

Epoch 241/499
------------------------
train Loss: 1.1124 Acc: 76.2261
test Loss: 1.2055 Acc: 75.4874

Epoch 242/499
------------------------
train Loss: 1.1657 Acc: 75.6070
test Loss: 1.1741 Acc: 75.6200

Epoch 243/499
------------------------
train Loss: 1.1279 Acc: 76.2407
test Loss: 1.1618 Acc: 75.3632

Epoch 244/499
------------------------
train Loss: 1.1759 Acc: 75.5944
test Loss: 1.1279 Acc: 76.2972

Epoch 245/499
------------------------
train Loss: 1.1602 Acc: 76.1386
test Loss: 1.1310 Acc: 76.2635

Epoch 246/499
------------------------
train Loss: 1.1258 Acc: 76.3907
test Loss: 1.1176 Acc: 76.3432

Epoch 247/499
------------------------
train Loss: 1.1252 Acc: 76.1516
test Loss: 1.1124 Acc: 76.7896

Epoch 248/499
------------------------
train Loss: 1.1230 Acc: 76.3677
test Loss: 1.1665 Acc: 76.1584

Epoch 249/499
------------------------
train Loss: 1.1453 Acc: 75.9295
test Loss: 1.1658 Acc: 76.3103

Epoch 250

train Loss: 0.9973 Acc: 79.1367
test Loss: 0.9773 Acc: 79.5643

Epoch 321/499
------------------------
train Loss: 1.0000 Acc: 79.1834
test Loss: 1.0050 Acc: 79.0944

Epoch 322/499
------------------------
train Loss: 1.0105 Acc: 79.0826
test Loss: 0.9740 Acc: 79.8168

Epoch 323/499
------------------------
train Loss: 0.9956 Acc: 79.1006
test Loss: 1.0157 Acc: 78.7162

Epoch 324/499
------------------------
train Loss: 0.9810 Acc: 79.2752
test Loss: 1.0420 Acc: 78.5314

Epoch 325/499
------------------------
train Loss: 0.9695 Acc: 79.6125
test Loss: 1.0090 Acc: 78.9094

Epoch 326/499
------------------------
train Loss: 1.0054 Acc: 78.8679
test Loss: 1.0496 Acc: 78.2713

Epoch 327/499
------------------------
train Loss: 0.9762 Acc: 79.4816
test Loss: 1.0391 Acc: 79.2389

Epoch 328/499
------------------------
train Loss: 0.9672 Acc: 79.6895
test Loss: 0.9886 Acc: 79.4073

Epoch 329/499
------------------------
train Loss: 0.9789 Acc: 79.6063
test Loss: 0.9591 Acc: 80.0584

Epoch 330

train Loss: 0.9149 Acc: 80.7688
test Loss: 0.9016 Acc: 80.7156

Epoch 401/499
------------------------
train Loss: 0.8657 Acc: 81.7711
test Loss: 0.9026 Acc: 80.7389

Epoch 402/499
------------------------
train Loss: 0.9111 Acc: 81.1752
test Loss: 0.9398 Acc: 80.6061

Epoch 403/499
------------------------
train Loss: 0.8750 Acc: 81.7046
test Loss: 0.8719 Acc: 82.0301

Epoch 404/499
------------------------
train Loss: 0.8496 Acc: 82.1271
test Loss: 0.9680 Acc: 80.3693

Epoch 405/499
------------------------
train Loss: 0.8869 Acc: 81.3743
test Loss: 0.9020 Acc: 81.4923

Epoch 406/499
------------------------
train Loss: 0.8842 Acc: 81.7736
test Loss: 0.8972 Acc: 81.0831

Epoch 407/499
------------------------
train Loss: 0.8828 Acc: 81.6444
test Loss: 0.8864 Acc: 81.5185

Epoch 408/499
------------------------
train Loss: 0.8665 Acc: 81.7956
test Loss: 0.8814 Acc: 81.6023

Epoch 409/499
------------------------
train Loss: 0.8439 Acc: 82.2832
test Loss: 0.8623 Acc: 81.7384

Epoch 410

train Loss: 0.7861 Acc: 83.5862
test Loss: 0.7720 Acc: 83.7349

Epoch 481/499
------------------------
train Loss: 0.7883 Acc: 83.4352
test Loss: 0.8058 Acc: 83.2950

Epoch 482/499
------------------------
train Loss: 0.8085 Acc: 83.1249
test Loss: 0.8101 Acc: 83.1727

Epoch 483/499
------------------------
train Loss: 0.7798 Acc: 83.5566
test Loss: 0.7960 Acc: 83.9188

Epoch 484/499
------------------------
train Loss: 0.7778 Acc: 83.7545
test Loss: 0.7612 Acc: 84.0610

Epoch 485/499
------------------------
train Loss: 0.7950 Acc: 83.4373
test Loss: 0.8163 Acc: 83.1439

Epoch 486/499
------------------------
train Loss: 0.7789 Acc: 83.6991
test Loss: 0.7999 Acc: 83.5962

Epoch 487/499
------------------------
train Loss: 0.8110 Acc: 83.0916
test Loss: 0.8478 Acc: 82.2453

Epoch 488/499
------------------------
train Loss: 0.7724 Acc: 83.8021
test Loss: 0.8003 Acc: 83.2012

Epoch 489/499
------------------------
train Loss: 0.7953 Acc: 83.4539
test Loss: 0.8616 Acc: 82.2520

Epoch 490

## Validation

In [177]:
x_tk = english_tokenizer
y_tk = french_tokenizer
x_id_to_word = {value: key for key, value in x_tk.word_index.items()}
y_id_to_word = {value: key for key, value in y_tk.word_index.items()}
x_id_to_word[0] = '<PAD>'
y_id_to_word[0] = '<PAD>'

In [178]:
len(x_id_to_word)

200

In [179]:
output.shape

torch.Size([1, 345, 21])

In [180]:
index = 0
pred_input = input[index].view(1,-1) 
#print(pred_input.shape)
pred_input = pred_input.numpy()
pred_input.reshape(-1)

output = model_1(input[index].view(1,-1).to(device))

pred_output = output.data.view(french_vocab_size+1, max_french_sequence_length).cpu().numpy()
#print(pred_output.shape)
pred_output[0]

target_output = label[index].numpy()

x_tk = english_tokenizer
y_tk = french_tokenizer
x_id_to_word = {value: key for key, value in x_tk.word_index.items()}
y_id_to_word = {value: key for key, value in y_tk.word_index.items()}
x_id_to_word[0] = '<PAD>'
y_id_to_word[0] = '<PAD>'
x_id_to_word[pred_input.reshape(-1)[0]]
print(' '.join([x_id_to_word[x] for x in pred_input.reshape(-1)]))
print(' '.join([y_id_to_word[x] for x in np.argmax(pred_output, axis=0)]))
print('correct French sentence')
print(' '.join([y_id_to_word[x] for x in target_output]))


new jersey is warm during autumn and it is never rainy in december <PAD> <PAD>
new jersey est parfois chaud à automne et il est jamais en en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
correct French sentence
new jersey est chaud pendant l' automne et il est jamais pluvieux en décembre <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>


### Generate the html

**Save your notebook before running the next cell to generate the HTML output.** Then submit your project.

In [None]:
# Save before you run this cell!
!!jupyter nbconvert *.ipynb

## Optional Enhancements

This project focuses on learning various network architectures for machine translation, but we don't evaluate the models according to best practices by splitting the data into separate test & training sets -- so the model accuracy is overstated. Use the [`sklearn.model_selection.train_test_split()`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to create separate training & test datasets, then retrain each of the models using only the training set and evaluate the prediction accuracy using the hold out test set. Does the "best" model change?