# Modelo de tradução em pares
As traduções abordadas aqui são:
- português brasileiro <-> espanhol
- português brasileiro <-> inglês
- italiano <-> latim
- italiano <-> espanhol
- italiano <-> inglês

## Bibliotecas

In [None]:
import nltk
from nltk.translate.bleu_score import corpus_bleu
nltk.download('punkt')

import torch
from torch import optim
from torch.utils.data import DataLoader

from IPython.display import Image

In [None]:
!pip3 install sentencepiece
!pip3 install transformers
!pip3 install translate-toolkit

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from translate.storage.tmx import tmxfile

## Corpora
Vamos utilizar corpora retirados do website [OPUS - an open source parallel corpus](https://opus.nlpl.eu/). Eles estão todos presentes na pasta 'Dados'.

In [None]:
Image(filename = 'Imagens/proto-indo-eu.jpg')

Primeiramente, lemos o córpus.

In [None]:
file_path = 'Dados/'

def read_corpus(filename, language_1, language_2):
    '''
    Read corpus function.

    Params:
    - filename (string): name of the file, e.g. 'en-pt_br.tmx'
    - language_1 (string): abbreviation of the first language, e.g. 'en'
    - language_2 (string): abbreviation of the second language, e.g. 'pt'

    Return:
    - f_lang1_to_lang2: file containing language_1 as source of translation and language_2 as target
    - f_lang2_to_lang1: file containing language_2 as source of translation and language_1 as target
    '''

    with open(file_path + filename, 'rb') as f_input:
        f_output = tmxfile(f_input, language_1, language_2)

    return f_output

Depois, o preparamos para o formato necessário para que a tradução ocorra.

In [None]:
def prepare_data(prefix_lang, filename, language_1, language_2):
    '''
    Format the files correctly for the translation

    Params:
    - prefix_lang: prefix for the target language, e.g. '>>pt_br<<'
    - filename (string): name of the file, e.g. 'en-pt_br.tmx'
    - language_1 (string): abbreviation of the first language, e.g. 'en'
    - language_2 (string): abbreviation of the second language, e.g. 'pt'
    
    Return:
    - data: data formatted for the translation from language_1 to language_2
    '''

    file = read_corpus(filename, language_1, language_2)

    data = [
        { 'src': prefix_lang + ' ' + w.source, 'trg': w.target }
        for w in file.unit_iter()
    ]

    # print("Total sentences in the file: " + str(len(data)))

    return data

Então, separamos em conjuntos de treino e teste.

In [None]:
def train_test(prefix_lang, filename, language_1, language_2):
    '''
    Split the data in train and test by 80/20

    Params:
    - prefix_lang: prefix for the target language, e.g. '>>pt_br<<'
    - filename (string): name of the file, e.g. 'en-pt_br.tmx'
    - language_1 (string): abbreviation of the first language, e.g. 'en'
    - language_2 (string): abbreviation of the second language, e.g. 'pt'

    Return:
    - train: first 80% of occurrences in data
    - test: last 20% of occurrences in data
    '''

    data = prepare_data(prefix_lang, filename, language_1, language_2)

    size = int(len(data) * 0.2)

    train = data[size:]
    test = data[:size]

    # print(train[0])
    # print(test[0])

    return train, test

## Treinamento

### Hiperparâmetros

In [None]:
LEARNING_RATE = 1e-5 
EPOCHS = 1
BATCH_SIZE = 16
BATCH_STATUS = 32
EARLY_STOP = 3
TOKEN_MAX_LENGTH = 128
NUM_BEAMS = 4
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### Prefixos

In [None]:
PREFIX_PT_BR = '>>pt_br<<'
PREFIX_LATIN = '>>la<<'
PREFIX_SPANISH = '>>es<<'
PREFIX_ITALIAN = '>>it<<'
PREFIX_ENGLISH = '>>en<<'

Iniciamos o treinamento separando os dados em batches (lotes).

In [None]:
def batch_train_test(prefix_lang, filename, language_1, language_2):
    '''
    Put the data in batches

    Params:
    - prefix_lang: prefix for the target language, e.g. '>>pt_br<<'
    - filename (string): name of the file, e.g. 'en-pt_br.tmx'
    - language_1 (string): abbreviation of the first language, e.g. 'en'
    - language_2 (string): abbreviation of the second language, e.g. 'pt'

    Return:
    - train_data: train data in batches
    - test_data: test data in batches
    '''

    train, test = train_test(prefix_lang, filename, language_1, language_2)

    train_data = DataLoader(train, batch_size = BATCH_SIZE)
    test_data = DataLoader(test, batch_size = BATCH_SIZE)

    return train_data, test_data

Avaliamos então nosso modelo com base na [pontuação BLEU (BiLingual Evaluation Understudy)](https://cloud.google.com/translate/automl/docs/evaluate).

In [None]:
def evaluate(prefix_lang, filename, language_1, language_2, model, tokenizer):
    '''
    Evaluate the model

    Params:
    - prefix_lang: prefix for the target language, e.g. '>>pt_br<<'
    - filename (string): name of the file, e.g. 'en-pt_br.tmx'
    - language_1 (string): abbreviation of the first language, e.g. 'en'
    - language_2 (string): abbreviation of the second language, e.g. 'pt'
    - model: neural network model
    - tokenizer: tokenizer, can be pre-trained

    Return:
    - bleu: BLEU score
    '''

    # Evaluate the model
    model.eval()
    
    y_real = []
    y_pred = []

    _, test_data = batch_train_test(prefix_lang, filename, language_1, language_2)
    
    for batch_idx, inp in enumerate(test_data):
        y_real.extend(inp['trg'])
        
        # tokenization
        model_inputs = tokenizer(
            inp['src'], 
            truncation = True, 
            padding = True, 
            max_length = TOKEN_MAX_LENGTH, 
            return_tensors = "pt"
        ).to(DEVICE)
        
        # Translation
        generated_ids = model.generate(**model_inputs, num_beams = NUM_BEAMS)
        
        # Translation post processing
        output = tokenizer.batch_decode(generated_ids, skip_special_tokens = True)
        y_pred.extend(output)
    
        # Print results
        if (batch_idx + 1) % BATCH_STATUS == 0:
            print(
                'Evaluation: [{}/{} ({:.0f}%)]'.format(
                    batch_idx + 1, len(test_data), 100. * batch_idx/ len(test_data)
                )
            )

    # Calculate BLUE score
    hyps, refs = [], []
    
    for i, snt_pred in enumerate(y_pred):
        hyps.append(nltk.word_tokenize(snt_pred))
        refs.append([nltk.word_tokenize(y_real[i])])
    
    bleu = corpus_bleu(refs, hyps)

    return bleu

Por fim, criamos nosso fluxo de treinamento.

In [None]:
def train(prefix_lang, filename, language_1, language_2, model, tokenizer, optimizer):
    '''
    Training loop

    Params:
    - prefix_lang: prefix for the target language, e.g. '>>pt_br<<'
    - filename (string): name of the file, e.g. 'en-pt_br.tmx'
    - language_1 (string): abbreviation of the first language, e.g. 'en'
    - language_2 (string): abbreviation of the second language, e.g. 'pt'
    - model: neural network model
    - tokenizer: tokenizer, can be pre-trained
    - optimizer: neural network optimizer
    '''

    train_data, test_data = batch_train_test(prefix_lang, filename, language_1, language_2)
    
    # Calculate initial BLEU score
    max_bleu = evaluate(prefix_lang, filename, language_1, language_2, model, tokenizer)
    print('Initial BLEU score:', max_bleu)
    
    # Train model
    model.train()
    repeat = 0
    
    for epoch in range(EPOCHS):
        losses = []

        for batch_idx, inp in enumerate(train_data):
            # Inicialize with the gradient equals to zero
            optimizer.zero_grad()

            # Tokenization
            model_inputs = tokenizer(
                inp['src'], 
                truncation = True,
                padding = True, 
                max_length = TOKEN_MAX_LENGTH, 
                return_tensors = "pt"
            ).to(DEVICE)
            
            with tokenizer.as_target_tokenizer():
                labels = tokenizer(
                    inp['trg'], 
                    truncation = True, 
                    padding = True, 
                    max_length = TOKEN_MAX_LENGTH, 
                    return_tensors = "pt"
                ).input_ids.to(DEVICE)
            
            # Translation and Forward pass
            output = model(**model_inputs, labels = labels)

            # Calculate loss
            loss = output.loss
            losses.append(float(loss))

            # Backpropagation
            loss.backward()
            optimizer.step()

            # Print results
            if (batch_idx + 1) % BATCH_STATUS == 0:
                print(
                    'Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}\tTotal Loss: {:.6f}'.format(
                        epoch + 1, batch_idx + 1, len(train_data), 100. * batch_idx/ len(train_data), 
                        float(loss), round(sum(losses)/ len(losses), 5)
                    )
                )

        # Calculate epoch BLEU score
        bleu = evaluate(model, tokenizer, test_data)
        print('BLEU:', bleu)
        
        if bleu > max_bleu:
            max_bleu = bleu
            repeat = 0

            # print('Saving best model...')
            # torch.save(model, write_path)
        else:
            repeat += 1

        if repeat == EARLY_STOP:
            break

## Modelo
Para mais detalhes, confira [aqui](https://huggingface.co/docs/transformers/model_doc/marian).

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-ROMANCE").to(DEVICE)
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-ROMANCE")
optimizer = optim.AdamW(model.parameters(), lr = LEARNING_RATE)

## Experimentos

In [None]:
def experiment(batch_input_sent, model, tokenizer):
    '''
    Given sentence, print translation

    Params:
    - batch_input_sent: input sentences to translate
    - model: neural network model
    - tokenizer: tokenizer, can be pre-trained
    '''

    # Tokenize sentences
    encoded = tokenizer(batch_input_sent, return_tensors = 'pt', padding = True).to(DEVICE)

    # Translation
    translated = model.generate(**encoded)

    # Prepare output
    tokenizer.batch_decode(translated, skip_special_tokens = True)

### Português brasileiro <-> Espanhol

#### Português brasileiro -> Espanhol

In [None]:
train(PREFIX_SPANISH, 'es-pt_BR.tmx', 'pt_br', 'es', model, tokenizer, optimizer)

In [None]:
batch_input_sent = (
    (">>es<< Será que isso vai funcionar?"),
    (">>es<< Teste número 2. Você consegue traduzir isso que eu sei!"),
    (">>es<< Acho que eu preciso deixar você rodando por mais tempo, né?"),
    (">>es<< Eu sei que eu deveria ser mais criativo nos meus testes, mas quero traduzir de português brasileiro para espanhol, mesmo que o BLEU seja baixo.")
)

In [None]:
experiment(batch_input_sent, model, tokenizer)

#### Espanhol -> Português brasileiro

In [None]:
train(PREFIX_PT_BR, 'es-pt_BR.tmx', 'es', 'pt_br', model, tokenizer, optimizer)

In [21]:
batch_input_sent = (
    (">>pt_br<< Buenos días."),
    (">>pt_br<< No sé cómo resultará esto."),
    (">>pt_br<< Por favor trabaja.")
)

In [None]:
experiment(batch_input_sent, model, tokenizer)

### Português brasileiro <-> Inglês

#### Português brasileiro -> Inglês

In [None]:
train(PREFIX_ENGLISH, 'en-pt_br.tmx', 'pt_br', 'en', model, tokenizer, optimizer)

In [None]:
batch_input_sent = (
    (">>en<< Será que isso vai funcionar?"),
    (">>en<< Teste número 2. Você consegue traduzir isso que eu sei!"),
    (">>en<< Acho que eu preciso deixar você rodando por mais tempo, né?"),
    (">>en<< Eu sei que eu deveria ser mais criativo nos meus testes, mas não acredito que consegui traduzir de português brasileiro para inglês, mesmo com um BLEU tão baixo.")
)

In [None]:
experiment(batch_input_sent, model, tokenizer)

#### Inglês -> Português brasileiro

In [None]:
train(PREFIX_PT_BR, 'en-pt_br.tmx', 'en', 'pt_br', model, tokenizer, optimizer)

In [None]:
batch_input_sent = (
    (">>pt_br<< Please, don't fail me now."), 
    (">>pt_br<< Who is a good translator? You are!"), 
    (">>pt_br<< I hope you are able to translate a big sentence, because people nowadays love texting. And I want to present this to my teacher and colleagues, so you have to work!"),
    (">>pt_br<< I really don't want to study tonight but I have to do it because I want to graduate and get a job and have a lot of money.")
)

In [None]:
experiment(batch_input_sent, model, tokenizer)

### Italiano <-> Latim

#### Italiano -> Latim

In [None]:
train(PREFIX_LATIN, 'it-la.tmx', 'it', 'la', model, tokenizer, optimizer)

In [None]:
batch_input_sent = (
    (">>la<< Allora molti morirono combattendo per la libertà."),
    (">>la<< Gli alunni ascoltavano i maestri per imparare molte cose.")
)

In [None]:
experiment(batch_input_sent, model, tokenizer)

#### Latim -> Italiano

In [None]:
train(PREFIX_ITALIAN, 'it-la.tmx', 'la', 'it', model, tokenizer, optimizer)

In [None]:
batch_input_sent = (
    (">>it<< Libertatis causa pugnantes multi tum ceciderunt."),
    (">>it<< Discipuli magistros audiebant ut multa discerent.")
)

In [None]:
experiment(batch_input_sent, model, tokenizer)

### Italiano <-> Espanhol

#### Italiano -> Espanhol

In [None]:
train(PREFIX_SPANISH, 'es-it.tmx', 'it', 'es', model, tokenizer, optimizer)

In [None]:
batch_input_sent = (
    (">>es<< Oggi devo studiare ancora per due ore."),
    (">>es<< Mi piace disegnare paesaggi di montagna."),
    (">>es<< Il mio computer non funziona tanto bene.")
)

In [None]:
experiment(batch_input_sent, model, tokenizer)

#### Espanhol -> Italiano

In [None]:
train(PREFIX_ITALIAN, 'es-it.tmx', 'es', 'it', model, tokenizer, optimizer)

In [None]:
batch_input_sent = (
    (">>it<< Buenos días."),
    (">>it<< No sé cómo resultará esto."),
    (">>it<< Por favor trabaja.")
)

In [None]:
experiment(batch_input_sent, model, tokenizer)

### Italiano <-> Inglês

#### Italiano -> Inglês

In [None]:
train(PREFIX_ENGLISH, 'en-it.tmx', 'it', 'en', model, tokenizer, optimizer)

In [None]:
batch_input_sent = (
    (">>en<< Oggi devo studiare ancora per due ore."),
    (">>en<< Mi piace disegnare paesaggi di montagna."),
    (">>en<< Il mio computer non funziona tanto bene.")
)

In [None]:
experiment(batch_input_sent, model, tokenizer)

#### Inglês -> Italiano

In [None]:
train(PREFIX_ITALIAN, 'en-it.tmx', 'en', 'it', model, tokenizer, optimizer)

In [None]:
batch_input_sent = (
    (">>it<< Please work. I'm exhausted already."),
    (">>it<< Now I just need one more sentence so I can finally save this file."),
    (">>it<< And I couldn't be creative, even at the very end.")
)

In [None]:
experiment(batch_input_sent, model, tokenizer)