NLP From Scratch: Translation with a Sequence to Sequence Network and Attention
===============================================================================
[Original Noitebook](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)

In this project we will be teaching a neural network to translate from
French to English.


    data path : /NLP/data/eng-fra.txt
    ------------------------------
    [KEY: > input, = target, < output]

    > il est en train de peindre un tableau .
    = he is painting a picture .
    < he is painting a picture .

    > pourquoi ne pas essayer ce vin delicieux ?
    = why not try that delicious wine ?
    < why not try that delicious wine ?

    > elle n est pas poete mais romanciere .
    = she is not a poet but a novelist .
    < she not not a poet but a novelist .

    > vous etes trop maigre .
    = you re too skinny .
    < you re all alone .


An encoder network condenses an input sequence into a vector,
and a decoder network unfolds that vector into a new sequence.

![](https://pytorch.org/tutorials/_static/img/seq-seq-images/seq2seq.png)



Loading data files
==================

The file is a tab
separated list of translation pairs:

``` {.sourceCode .sh}
I am cold.    J'ai froid.
```

### Word level One-Hot Embedding

each word in a language as is a one-hot
vector, or giant vector of zeros except for a single one (at the index
of the word). Compared to the dozens of characters that might exist in a
language, there are many many more words, so the encoding vector is much
larger. We will however cheat a bit and trim the data to only use a few
thousand words per language.

![](https://pytorch.org/tutorials/_static/img/seq-seq-images/word-encoding.png)



class Lang:

    word → index (word2index)  
    index → word (index2word)
    word2count --> replace rare words later.


dataset :  

    input_lang - fra ,
    output_lang - eng,
    pairs - ['je suis toujours tres nerveux', 'i m always very nervous']

In [5]:
# Turn a Unicode string to plain ASCII, thanks to
# https://stackoverflow.com/a/518232/2809427

# Lowercase, trim, and remove non-letter characters
# s = re.sub(r"([.!?])", r" \1", s)
# s = re.sub(r"[^a-zA-Z!?]+", r" ", s)

In [35]:
SOS_token = 0
EOS_token = 1

class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # Count SOS and EOS

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1


MAX_LENGTH = 10

eng_prefixes = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s ",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "they re "
)

def filterPair(p):
    return len(p[0].split(' ')) < MAX_LENGTH and \
        len(p[1].split(' ')) < MAX_LENGTH and \
        p[1].startswith(eng_prefixes)


def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

The Seq2Seq Model
=================
 The encoder reads an input
sequence and outputs a single vector, and the decoder reads that vector
to produce an output sequence.

![](https://pytorch.org/tutorials/_static/img/seq-seq-images/seq2seq.png)

Unlike sequence prediction with a single RNN, where every input
corresponds to an output, the seq2seq model frees us from sequence
length and order, which makes it ideal for translation between two
languages.



With a seq2seq model the encoder creates a single vector which, in the
ideal case, encodes the \"meaning\" of the input sequence into a single
vector --- a single point in some N dimensional space of sentences.


The Encoder
===========

The encoder of a seq2seq network is a RNN that outputs some value for
every word from the input sentence.

    EncoderRNN(
      (embedding): Embedding(input_size, hidden_size)
      (gru): GRU(hidden_size, hidden_size, batch_first=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )

For every input word the encoder
outputs a vector and a hidden state, and uses the hidden state for the
next input word.

    input_seq(input to Encoder): torch.Size([batch_size, seq_length])
    Embeded: torch.Size([batch_size, seq_length, hidden_size])
    Encoder output: torch.Size([batch_size, seq_length, hidden_size])
    Encoder hidden state: torch.Size([1, batch_size, hidden_size])


    input_size = Size of the input vocabulary
    hidden_size = Size of the hidden state
    seq_length = MAX_LENGTH of Sequence
    

![](https://pytorch.org/tutorials/_static/img/seq-seq-images/encoder-network.png)


The Decoder
===========

The decoder is another RNN that takes the encoder output vector(s) and
outputs a sequence of words to create the translation.

**Encoder output / context vector** : used as the initial hidden state of the decoder.

    DecoderRNN(
      (embedding): Embedding(output_size, hidden_size)
      (gru): GRU(hidden_size, hidden_size, batch_first=True)
      (out): Linear(in_features=hidden_size, out_features=output_size, bias=True)
    )

At every step of decoding, the decoder is given an input token and hidden state.

    Decoder Input(Initial):  torch.Size([batch_size, 1])
    Decoder Hidden(Initial):  torch.Size([1, batch_size, hidden_size])

The initial input token is the start-of-string <SOS> token, and the first hidden state is the context vector (the encoder's last hidden state).

    Forward-Step - total MAX_LENGTH no. of times
    decoder_outputs, decoder_hidden, _ = decoder(encoder_outputs, encoder_hidden)

    Input(Forward-Step):  torch.Size([batch_size, 1])
    Output(Forward-Step : Embedding):  torch.Size([batch_size, 1, hidden_size])
    Output(Forward-Step : GRU):  torch.Size([batch_size, 1, hidden_size])
    Output(Forward-Step : Hidden):  torch.Size([1, batch_size, hidden_size])
    Output(Forward-Step : Output):  torch.Size([batch_size, 1, output_size])
    Decoder Output(each word):  torch.Size([batch_size, 1, output_size])
    Decoder Hidden(each word):  torch.Size([1, batch_size, hidden_size])
    Decoder Input (without teacher forcing):  torch.Size([batch_size, 1])

if we use

    target_tensor = torch.randint(0, output_size, (batch_size , MAX_LENGTH)).to(device)
    decoder_outputs, decoder_hidden, _ = decoder(encoder_outputs, encoder_hidden, target_tensor)

we'll have - teacher forcing

    Decoder Input (teacher forcing):  torch.Size([batch_size, 1])


**Teacher forcing** is a training technique for sequence-to-sequence models.

During training, the model is fed the **actual target outputs** instead of its own predictions.

This helps **stabilize training** and prevent error propagation.

In inference, the model uses its **own predictions** from the previous time step.

![](https://pytorch.org/tutorials/_static/img/seq-seq-images/decoder-network.png)


**Why a custom Feed-Forward method**

The decoder's forward function is more complex because it operates in a conditional manner. It takes the context vector, the previous output (during training, this is the target sequence; during inference, this is the model's own predictions), and the previous hidden state to generate the next output. This process of generating outputs conditionally based on previous outputs requires a separate forward function in the decoder.

Attention Decoder
=================

![](https://i.imgur.com/1152PYf.png)

Calculating the attention weights is done with another feed-forward
layer `attn`, using the decoder\'s input and hidden state as inputs.
Because there are sentences of all sizes in the training data, to
actually create and train this layer we have to choose a maximum
sentence length (input length, for encoder outputs) that it can apply
to. Sentences of the maximum length will use all the attention weights,
while shorter sentences will only use the first few.

![](https://pytorch.org/tutorials/_static/img/seq-seq-images/attention-decoder-network.png)




    AttnDecoderRNN(
      (embedding): Embedding(output_size, hidden_size)
      (attention): BahdanauAttention(
        (Wa): Linear(in_features=hidden_size, out_features=hidden_size, bias=True)
        (Ua): Linear(in_features=hidden_size, out_features=hidden_size, bias=True)
        (Va): Linear(in_features=hidden_size, out_features=1, bias=True)
      )
      (gru): GRU(2 * hidden_size, hidden_size, batch_first=True)
      (out): Linear(in_features=hidden_size, out_features=output_size, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )

Forward

    Encoder Output(Initial):  torch.Size([batch_size, MAX_LENGTH, hidden_size])
    Decoder Input(Initial):  torch.Size([batch_size, 1])
    Decoder Hidden(Initial):  torch.Size([1, batch_size, hidden_size])

Forward-Step : each word upto MAX_LENGTH

    Embedded Shape(Forward-Step): torch.Size([batch_size, 1, hidden_size])
    Scores Shape: torch.Size([batch_size, MAX_LENGTH, 1])
    Scores re-Shape: torch.Size([batch_size, 1, MAX_LENGTH])
    Weights Shape: torch.Size([batch_size, 1, MAX_LENGTH])
    Context Shape: torch.Size([batch_size, 1, hidden_size])
    Query Shape(Forward-Step): torch.Size([batch_size, 1, hidden_size])
    Context Shape(Forward-Step): torch.Size([batch_size, 1, hidden_size])
    Input GRU Shape(Forward-Step): torch.Size([batch_size, 1, 2 * hidden_size])
    Output(Forward-Step : GRU):  torch.Size([batch_size, 1, hidden_size])
    Output(Forward-Step : Hidden):  torch.Size([1, batch_size, hidden_size])
    Output(Forward-Step : Output):  torch.Size([batch_size, 1, output_size])
    Decoder Output(each word):  torch.Size([batch_size, 1, output_size])
    Decoder Hidden(each word):  torch.Size([1, batch_size, hidden_size])
    Attention Weights(each word):  torch.Size([batch_size, 1, MAX_LENGTH])
    Decoder Input (teacher forcing):  torch.Size([batch_size, 1])


[Local Attention](https://arxiv.org/abs/1508.04025) : uses relative position approch to limit length

Training
========

Preparing Training Data
-----------------------

To train, for each pair we will need an input tensor (indexes of the
words in the input sentence) and target tensor (indexes of the words in
the target sentence). While creating these vectors we will append the
EOS token to both sequences.


indexesFromSentence(lang, sentence) : length of sentence

    get each word of the sentence ---> return list of lang.word2index[word]

tensorFromSentence(lang,sentence) : tensor.Size([1,,length_of_indFromSentence+1])

    add EOS_token in the list of indexFromSentence(..) --> convert to tensor

tensorsFromPair(pair) :  (inuttensor,output_tensor)

    get each tensor using tensorFromSentence(..,device)
  
get_dataloader(batch_size):input_lang, output_lang, train_dataloader

    get input_lang,output_lang,pairs  using prepareData(..)-->  
    input_ids & target_ids : (len(pairs),MAX_LENGTH)
    inp_ids = indexFromSentence(input_lang,pairs[0])
    inp_ids.append(EOS_token)
    input_ids[idx, :len(inp_ids)] = inp_ids
    train_data = TensorDataset(..,use device for both input_ids & target_ids)
    train_dataloader = DataLoader(train_data, sampler=RandomSampler(train_data), batch_size=batch_size)


Training the Model
==================
You can observe outputs of teacher-forced networks that read with
coherent grammar but wander far from the correct translation
-intuitively it has learned to represent the output grammar and can
\"pick up\" the meaning once the teacher tells it the first few words,
but it has not properly learned how to create the sentence from the
translation in the first place.

Because of the freedom PyTorch\'s autograd gives us, we can randomly
choose to use teacher forcing or not with a simple if statement. Turn
`teacher_forcing_ratio` up to use more of it.

    loss = per epoch , run through entire dataset once forward & backward

    train_epoch(dataloader, encoder, decoder, encoder_optimizer,decoder_optimizer, criterion):
            add zero_grad() for both optimizers
            encoder_outputs, encoder_hidden = encoder(input_tensor)
            decoder_outputs, _, _ = decoder(encoder_outputs, encoder_hidden, target_tensor)


    Input:  torch.Size([batch_size, MAX_LENGTH])
    Target: torch.Size([batch_size, MAX_LENGTH])
    Embeded  torch.Size([batch_size, MAX_LENGTH, hidden_size])
    output & hidden  torch.Size([batch_size, MAX_LENGTH, hidden_size]) torch.Size([1, batch_size, hidden_size])
    Encoder Output/Context Vector:  torch.Size([batch_size, MAX_LENGTH, hidden_size])
    Encoder Hidden:  torch.Size([1, batch_size, hidden_size])
    Decoder Output: torch.Size([batch_size, MAX_LENGTH, output_lang.n_words])

    input_lang.n_words = 4601,output_lang.n_words = 2991

  loss function : criterion = nn.NLLLoss()

  NLLLoss is often used to calculate the loss between the predicted probability distribution over words and the actual target word indices. It measures the negative log likelihood of the predicted word under the target distribution.

    Decoder Output: torch.Size([batch_size * MAX_LENGTH, output_lang.n_words]) - each row contains the output probabilities for each word in the vocabulary for a given timestep and batch instance.
    Target : torch.Size([batch_size * MAX_LENGTH]) - Each element in this tensor corresponds to the index of the target word at each timestep and batch instance.

Training and Evaluating
=======================

    input_lang, output_lang, train_dataloader = get_dataloader(batch_size)

    encoder = EncoderRNN(input_lang.n_words, hidden_size).to(device)
    decoder = AttnDecoderRNN(hidden_size, output_lang.n_words).to(device)


Train

    Input:  torch.Size([batch_size, MAX_LENGTH])
    Target: torch.Size([batch_size, MAX_LENGTH])
    Embeded  torch.Size([batch_size, MAX_LENGTH, hidden_size])
    output & hidden  torch.Size([batch_size, MAX_LENGTH, hidden_size]) torch.Size([1, batch_size, hidden_size])
    Encoder Output/Context Vector:  torch.Size([batch_size, MAX_LENGTH, hidden_size])
    Encoder Hidden:  torch.Size([1, batch_size, hidden_size])
    Decoder Output: torch.Size([batch_size, MAX_LENGTH, output_lang.n_words])

    input_lang.n_words = 4601,output_lang.n_words = 2991
    


Evaluation

    encoder_outputs, encoder_hidden = encoder(input_tensor)
    decoder_outputs, decoder_hidden, decoder_attn = decoder(encoder_outputs, encoder_hidden)

    _, topi = decoder_outputs.topk(1)
    decoded_ids = topi.squeeze()


Set dropout layers to eval mode

    encoder.eval()
    decoder.eval()

    > tu es celui que j attendais
    = you re the one i ve been waiting for
    Embeded(Encoder): torch.Size([1, len(sentence), hidden_size])
    output & hidden(Encoder): torch.Size([1, len(sentence), hidden_size]), torch.Size([1, 1,hidden_size])
    < i m not a good to be <EOS>

In [None]:
#