<a href="https://colab.research.google.com/github/JhonathanOrtiz/NLP/blob/master/Seq2Seq.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



> # <strong>  Introducción </strong>

<h1>Spanish</h1>

La siguiente aplicación tiene como finalidad mediante un modelo Seq2Seq mapear desde un conjunto de entrada los cuales serán oraciones de problemas matemáticos a su respectiva equación.

Ya que el texto es un tipo de dato categorizado como serie de tiempo porque las letras y palabras tienen una realción entre si utilizamos modelos de Redes Neuronales Recurrentes porque ellas son capaces de recordar información. [Aqui puedes profundizar](https://towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e)

Cuando se trabaja con texto tenemos varios tipos de modelos, aquellos donde desde una secuencia queremos predecir alguna característica en particular *Many2One*, podemos querer desde una característica representar una secuencia *One2Many* o desde una Sequencia predecir otra Secuncia *Many2Many* nosotros nos centraremos en ese modelo ya que a partir de una secuencia de texto queremos (Oración) predecir otra sequencia (Ecuación)




> # <strong>  Introduction </strong>

<h1>English</h1>

The following application is intended to map from a set of inputs, which will be sentences of mathematical problems to their respective equation, using a Seq2Seq model.

Since text is a type of data categorized as a time series because letters and words have a relationship with each other we use Recurrent Neural Network models because they are able to remember information. [Here you can go deeper](https://towardsdatascience.com/understanding-rnn-and-lstm-f7cdf6dfc14e)

When we work with text we have several types of models, those where from a sequence we want to predict some particular characteristic *Many2One*, we can want from a characteristic to represent a *One2Many* sequence or from a Sequence to predict another *Many2Many* Section we will focus on that model since from a text sequence we want (Sentence) to predict another sequence (Equation)





In [None]:
!python -m spacy download es

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('es_core_news_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/es_core_news_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/es
You can now load the model via spacy.load('es')


In [None]:
%cd /content/drive/My Drive

/content/drive/My Drive


In [None]:
import pandas as pd
import spacy
from torchtext.data import Field, BucketIterator,TabularDataset
from torch.utils.data import Dataset, DataLoader, TensorDataset
from torch import nn
import torch
from torch import optim
import re
import numpy as np
import random
from torch.utils.data.dataset import random_split
from torch.utils.tensorboard import SummaryWriter

In [None]:
esp = spacy.load('es')



> # Tokenizacion

<h1>Spanish</h1>

Tokenizar es convertir un texto en una lista de palabras ó caracteres.

En éste primer paso definiremos dos funciones, la primera tokenizará por palabras las oraciones ya que son oraciones comunes y corrientes. Sin embargo ya que nuestro target son ecuaciones decidimos tokenizar por caracter.





> # Tokenization

<h1>English</h1>

Tokenize transform text into list of word or character.

In this first step we define two function, one to tokenize input by word and other to tokenize the target equations by character, since our target are equations and a equantion.

In [None]:
def tokenizer(text):
  return [tok.text.lower() for tok in esp.tokenizer(text) if tok.text != " "]
  
def split_label(text):
  label = []
  for char in text:
    if char != '(' and char != ')' and  char != '[' and char != ']' and char != "'" :
      label.append(char)
  return label



> # Dataset

<h1>Spanish</h1>

Nuestro dataset consta de problemas matematicos como los del colegio, estan en un archivo .csv que yo he preprocesado antes de pasarlo a ese tipo de archivo, para poder usar la librería de [torchtext](https://torchtext.readthedocs.io/en/latest/data.html) puedes chequear la documentación.




> # Dataset

<h1>English</h1>

Our dataset is set about math word problem, this datset is a .csv file where one feature (Input) is the math word problem, and the another feature is the equations (Target). Readed With [torchtext](https://torchtext.readthedocs.io/en/latest/data.html) 


In [None]:
SRC = Field(tokenize = tokenizer, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

TRG = Field(tokenize = split_label, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

field = {'Input': ('SRC', SRC), 'Target': ('TRG', TRG)}

In [None]:
train_data, test_data = TabularDataset.splits(path='/content/drive/My Drive',
                                             train = 'dataframe.csv',
                                             test = 'dataframe.csv',
                                             format='csv',
                                             fields= field)

In [None]:
SRC.build_vocab(train_data, min_freq = 1)
TRG.build_vocab(train_data, min_freq = 1)



> # El modelo

<h1>Spanish</h1>

# <h1>Encoder</h1>


Nuestro modelo Seq2Seq lleva como principio una estructura Encoder-Decoder, en esta sección hablaremos del encoder. Cuando pasamos información a través del Encoder el primer paso es una capa *Embedding* ésta será la encargada de mapear la entrada desde una entrada n-dimensional (Esto es el vector one-hot al que corresponde una oración) a un vecor denso con dimensiones que nosotros definiremos.

Luego de ésto el siguiente paso es pasar a través del módulo GRU (Gated Recurrect Unit) que es un tipo de LSTM si quieres información sobre ella haz click [aqui](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21) al modulo GRU le pasaremos el vector embedding que creamos en el paso anterior y ésta nos devolverá un output y un hidden state, a diferencia de un modulo LSTM tradicional este no retorna un cell state. Hidden es un vector de tamaño fijo y será el input para el decoder.

Hasta ahora... Tenemos un input el cual pasamos por una capa embedding (Para tener representaciones no equidistante de las palaras) ese vector denso será el input de la celda recurrente que mapeará un input a una entrada fija. podríamos decir que estamos aprendiendo una relacion comprimida de la información.




>  # Build the model

<h1>English</h1>
<h2>Encoder</h2>

Our Seq2Seq model has as a principle an Encoder-Decoder structure, in this section we will talk about the encoder. When we feed information through Encoder.

 the first step is an Embedding layer which will be in charge of mapping the input from an n-dimensional input (this is the one-hot vector that a sentence corresponds to) to a dense vector with dimensions that we will define.

After this the next step is to pass through the GRU (Gated Recurrect Unit) module which is a type of LSTM if you want information about it click [here](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21) to the GRU module we will pass the vector embedding that we created in the previous step and it will return an output and a hidden state, unlike a traditional LSTM module it doesn't return a cell state. Hidden is a fixed size vector and will be the input for the decoder.

So far... We have an input which we pass through an embedding layer (To have non equidistant representations of the blades) that dense vector will be the input of the recurrent cell that will map an input to a fixed input. We could say that we are learning a compressed relation of the information.


In [None]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, dropout):
        super().__init__()

        self.hid_dim = hid_dim
        
        self.embedding = nn.Embedding(input_dim, emb_dim) #no dropout as only one layer!
        
        self.rnn = nn.GRU(emb_dim, hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, hidden = self.rnn(embedded) #no cell state!
        
        #outputs = [src len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        
        #outputs are always from the top hidden layer
        
        return hidden

<h1>Decoder</h1>
<h2>Spanish</h2>

Hemos pasado un vector one-hot que representa una oración a través de un encoder y ese encoder transformó esa representación primero en un vector denso y luego en un vector de tamaño fijo que representa nuestra entrada. Ok ahora ese vector tenemos que transformarlo en otra oración ese es el trabajo de Seq2Seq pasar de una secencia de texto a otra secuencia. En este caso nuestra secuencia son oraciones y por esa razón tokenizamos por character nuestro target asi en cada step el modelo debe predecir un caracter.

El Decoder cuenta con una capa de Embedding que hace el mismo trabajo de que el embedding del Encoder. La entrada de esta capa será en el primer paso el token inicial de cada oración y para eso nosotros definimos antes un token para denotar el inicio de la oración < s o s > y en el siguiente step el caracter que hemos predicho y asi iterativamente, por eso debe tener como dimensiones de entrada las dimensiones del vocaculario target.

Una capa celda GRU y una capa Fully-Connected que nos dará las dimensiones la probalidad de que un  prediccion pertenezca a una categoría u a otra.




<h1>Decoder</h1>
<h2>English</h2>

We have feed a one-hot vector representing a sentence through an encoder and that encoder transformed that representation first into a dense vector and then into a fixed size vector representing our input. Ok now that vector we have to transform it into another sentence that is the job of Seq2Seq to pass from one text sequence to another sequence. In this case our sequence is sentences and for that reason we token by character our target so in each step the model must predict a character.

The Decoder has an embedding layer that does the same job as the Encoder embedding. The input of this layer will be in the first step the initial token of each sentence and for that we define before a token to denote the beginning of the sentence < s o s > and in the next step the character we have predicted and so iteratively, so it must have as input dimensions the dimensions of the target vocabulary.

A GRU cell layer and a Fully-Connected layer that will give us the dimensions the probability that a prediction belongs to one category or another.


In [None]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, dropout):
        super().__init__()

        self.hid_dim = hid_dim
        self.output_dim = output_dim
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.GRU(emb_dim + hid_dim, hid_dim)
        
        self.fc_out = nn.Linear(emb_dim + hid_dim * 2, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, context):
        
        #input = [batch size]
        #hidden = [n layers * n directions, batch size, hid dim]
        #context = [n layers * n directions, batch size, hid dim]
        
        #n layers and n directions in the decoder will both always be 1, therefore:
        #hidden = [1, batch size, hid dim]
        #context = [1, batch size, hid dim]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
                
        emb_con = torch.cat((embedded, context), dim = 2)
            
        #emb_con = [1, batch size, emb dim + hid dim]
            
        output, hidden = self.rnn(emb_con, hidden)
        
        #output = [seq len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        
        #seq len, n layers and n directions will always be 1 in the decoder, therefore:
        #output = [1, batch size, hid dim]
        #hidden = [1, batch size, hid dim]
        
        output = torch.cat((embedded.squeeze(0), hidden.squeeze(0), context.squeeze(0)), 
                           dim = 1)
        
        #output = [batch size, emb dim + hid dim * 2]
        
        prediction = self.fc_out(output)
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden



> # Ponemos todo junto

<h2>Spanish</h2>

El foward pass general de nuestro modelo consiste en realizar una predicción en nuestro encoder y luego iterarivamente realizar predicciones en el decoder, ¿Hasta cuando? hasta que el modelo prediga el token que denota el final de la oracion < e o s > en la primera iteración del decoder tendremos como input el token inicial < s o s > pero en la siguiente iteraciòn el input será el output que acabamos de predecir. 

Hay una pequeña policy llamada Teacher Forcing que dice si con una probalidad de n por ciento utilizaremos el actual proximo token como input de lo contrario usamos el token predicho.





> # Put all together

<h2>English</h2>

The general foward pass of our model consists in making a prediction in our encoder and then iterarily making predictions in the decoder, Until when? until the model predicts the token that denotes the end of the sentence < e o s > in the first iteration of the decoder we will have as input the initial token < s o s > but in the following iteration the input will be the output that we have just predicted. 

There is a little policy called Teacher Forcing that says if with a probability of n percent we will use the current next token as input otherwise we use the predicted token.



In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #last hidden state of the encoder is the context
        hidden = self.encoder(src)
        
        #context also used as the initial hidden state of the decoder
             
        context = hidden
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden state and the context state
            #receive output tensor (predictions) and new hidden state
            output, hidden = self.decoder(input, hidden, context)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1

        return outputs

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
#Train phase
BATCH_SIZE = 128

train_iterator, test_iterator = BucketIterator.splits(
    (train_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device)

In [None]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, DEC_DROPOUT)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = Seq2Seq(enc, dec, device).to(device)

In [None]:
def init_weights(m):
  for name, param in m.named_parameters():
    nn.init.uniform_(param.data, -0.08, 0.08)

model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(1555, 256)
    (rnn): GRU(256, 512)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(21, 256)
    (rnn): GRU(768, 512)
    (fc_out): Linear(in_features=1280, out_features=21, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [None]:
def solve_problem(model, SRC, TRG, sentence, device, max_length=50):

  model.eval()

  spacy_es = spacy.load('es')

  if type(sentence) == str:
    tokens = [tok.text.lower() for tok in spacy_es(sentence)]
  else:
    [token.lower() for token in sentence]

  tokens.insert(0, SRC.init_token)
  tokens.append(SRC.eos_token)

  text_to_indices = [SRC.vocab.stoi[token] for token in tokens]

  sentence_tensor = torch.LongTensor(text_to_indices).unsqueeze(1).to(device)
  
  with torch.no_grad():

      hidden = model.encoder(sentence_tensor)
      context = hidden 

  outputs = [TRG.vocab.stoi["<sos>"]]

  for _ in range(max_length):
    previous_word = torch.LongTensor([outputs[-1]]).to(device)

    with torch.no_grad():
      output, hidden = model.decoder(previous_word, hidden, context)
      best_guess = output.argmax(1).item()

    outputs.append(best_guess)

        # Model predicts it's the end of the sentence
    if output.argmax(1).item() == TRG.vocab.stoi["<eos>"]:
      break

  translated_sentence = [TRG.vocab.itos[idx] for idx in outputs]

    # remove start token
  return translated_sentence[1:]

In [None]:
optimizer = torch.optim.Adam(model.parameters())
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index=TRG_PAD_IDX)

In [None]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.SRC
        trg = batch.TRG
        
        optimizer.zero_grad()
        output = model(src, trg)
 
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [None]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
import time
import math

N_EPOCHS = 20
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    sentence = "Bridget tiene 4 chicles. Henry tiene 4 chicles. Si Henry le da todos sus chicles a Bridget, ¿cuántos chicles tendrá Bridget?"
    solve = solve_problem(model, SRC, TRG, sentence, device)
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)  
  
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print('Sentece {}, Result {}'.format(sentence, solve))


Epoch: 01 | Time: 0m 19s
	Train Loss: 2.661 | Train PPL:  14.305
Sentece Bridget tiene 4 chicles. Henry tiene 4 chicles. Si Henry le da todos sus chicles a Bridget, ¿cuántos chicles tendrá Bridget?, Result ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1']
Epoch: 02 | Time: 0m 19s
	Train Loss: 2.084 | Train PPL:   8.035
Sentece Bridget tiene 4 chicles. Henry tiene 4 chicles. Si Henry le da todos sus chicles a Bridget, ¿cuántos chicles tendrá Bridget?, Result ['x', '=', '.', '0', '.', '0', '0', '.', '0', '0', '<eos>']
Epoch: 03 | Time: 0m 19s
	Train Loss: 1.845 | Train PPL:   6.327
Sentece Bridget tiene 4 chicles. Henry tiene 4 chicles. Si Henry le da todos sus chicles a Bridget, ¿cuántos chicles tendrá Bridget?, Result ['x', '=', '.', '0', '0', '0', '<eos>']
Epoch: 04 | Time: 0m 18s
	Train