---
# Trabalho 1 - Tópicos Especiais em Matemática Aplicada

**Alunos/Matricula:** João V. Farias & Renan V. Guedes / 221022604 & 221031363

**Arquitetura Usada:** Encoder-Decoder

**Dataset Link:** V1: [D-Talk](https://www.tensorflow.org/datasets/catalog/ted_hrlr_translate#ted_hrlr_translatefr_to_pt) from TensorFlow.Datasets

---

### **Projeto para traduzir mensagens do Francês para o Portugues**  

Neste projeto, vamos explorar e comparar três arquiteturas de redes neurais para tradução automática do francês para o português, usando o *dataset* TED Talks do *Open Translation Project*. A ideia é testar modelos do tipo **Encoder-Decoder**, analisando suas diferenças e impacto na qualidade da tradução.  

Os três modelos que vamos treinar são:  

1. **LSTM (Long Short-Term Memory)**  
   - Um modelo básico de rede recorrente bidirecional. O **Encoder** processa a frase em francês e gera um contexto, enquanto o **Decoder** usa esse contexto para formar a tradução em português.  
   - A principal vantagem desse modelo é sua capacidade de lidar com dependências de longo prazo nas sequências.  

2. **LSTM com Mecanismos de Atenção**  
   - Uma versão aprimorada do modelo anterior, adicionando camadas de atenção (produto escalar, Bahdanau e Luong).  
   - A atenção ajuda o modelo a "olhar" para partes específicas da frase de entrada enquanto traduz, melhorando a coerência e precisão.  

3. **Transformers**  
   - Uma abordagem mais moderna, baseada em **autoatenção**, eliminando o uso de redes recorrentes.  
   - Trabalha com processamento paralelo, usando *Multi-Head Attention* e *Positional Encoding* para entender relações entre palavras, mesmo quando estão distantes na frase.  

**Como vamos testar os modelos?**  
- **Dataset**: Vamos usar cerca de 52.000 pares de frases (francês-português) para treinar, além de 1.200 para validação e 1.800 para teste.  
- **Pré-processamento**: Faremos a tokenização com *SubwordTextEncoder* para reduzir palavras fora do vocabulário (*out-of-vocabulary* – OOV).  
- **Treinamento**: Otimização com Adam, acompanhando a perda (*loss*) e a acurácia durante o processo.  
- **Avaliação**: Vamos comparar os resultados usando a métrica BLEU e analisar exemplos práticos das traduções.  

**O que esperamos encontrar?**  
- Nosso objetivo é entender qual desses modelos tem o melhor equilíbrio entre qualidade de tradução e eficiência computacional.  
- É provável que os Transformers tenham um desempenho superior, já que conseguem processar frases de forma mais eficiente, enquanto os modelos com LSTM e atenção devem mostrar um avanço significativo sobre a versão básica de LSTM.  

No fim das contas, essa análise pode ajudar a compreender melhor como diferentes abordagens de deep learning se saem em tarefas de tradução, trazendo insights úteis para aplicações reais em NLP.

---

## 📚 Importando as bibliotecas necessárias

In [23]:
# Primeiro, vamos importar todas as bibliotecas que vamos precisar ao longo do Projeto
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import os

# Para processamento de texto
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Componentes do Keras
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding, Attention, MultiHeadAttention, Concatenate, TimeDistributed
from tensorflow.keras.optimizers import Adam

# Para visualização dos resultados
from sklearn.metrics import confusion_matrix

tf.config.optimizer.set_jit(True)  # Ativa o XLA JIT compilation
tf.keras.mixed_precision.set_global_policy('mixed_float16') # Força o TensorFlow a usar precisão mista

print("TensorFlow version:", tf.__version__)

TensorFlow version: 2.18.0


## 🔍 Carregando e Preparando os Dados

In [24]:
examples, metadata = tfds.load('ted_hrlr_translate/fr_to_pt', with_info=True, as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']


 Dataset carregado e preprocessado com sucesso! 🎉


In [None]:
def preprocess_text(text):
    text = tf.strings.regex_replace(text, r"([?.!,¿])", r" \1 ")
    text = tf.strings.regex_replace(text, r'[" "]+', " ")
    text = tf.strings.strip(text)
    text = tf.strings.join(['[START]', text, '[END]'], separator=' ')
    return text

In [None]:
def prepare_dataset(examples, max_samples=None):
    fr_texts, pt_texts = [], []
    for fr, pt in examples:
        fr_texts.append(preprocess_text(fr).numpy().decode('utf-8'))
        pt_texts.append(preprocess_text(pt).numpy().decode('utf-8'))
        if max_samples and len(fr_texts) >= max_samples:
            break
    return fr_texts, pt_texts

In [None]:
fr_texts, pt_texts = prepare_dataset(train_examples, max_samples=50000)

In [None]:
val_fr_texts, val_pt_texts = prepare_dataset(val_examples, max_samples=10000)

## 🛠️ Configurando os Tokenizers

In [27]:
def tokenize_and_pad(fr_texts, pt_texts, max_input_length, max_target_length):
    tokenizer_fr = Tokenizer(filters='', oov_token='[OOV]')
    tokenizer_fr.fit_on_texts(fr_texts)
    input_vocab_size = len(tokenizer_fr.word_index) + 1

    tokenizer_pt = Tokenizer(filters='', oov_token='[OOV]')
    tokenizer_pt.fit_on_texts(pt_texts)
    target_vocab_size = len(tokenizer_pt.word_index) + 1

    # Convert texts to sequences
    fr_sequences = tokenizer_fr.texts_to_sequences(fr_texts)
    pt_sequences = tokenizer_pt.texts_to_sequences(pt_texts)

    # Prepare decoder inputs and outputs
    decoder_inputs = [seq[:-1] for seq in pt_sequences]
    decoder_outputs = [seq[1:] for seq in pt_sequences]

    # Pad sequences
    encoder_inputs = pad_sequences(fr_sequences, maxlen=max_input_length, padding='post')
    decoder_inputs = pad_sequences(decoder_inputs, maxlen=(max_target_length - 1), padding='post')
    decoder_outputs = pad_sequences(decoder_outputs, maxlen=(max_target_length - 1), padding='post')

    return encoder_inputs, decoder_inputs, decoder_outputs, input_vocab_size, target_vocab_size


In [25]:
max_input_length = 50
max_target_length = 50

In [28]:
train_encoder_inputs, train_decoder_inputs, train_decoder_outputs, input_vocab_size, target_vocab_size = tokenize_and_pad(
    fr_texts, pt_texts, max_input_length, max_target_length
)

In [None]:
val_encoder_inputs, val_decoder_inputs, val_decoder_outputs, _, _ = tokenize_and_pad(
    val_fr_texts, val_pt_texts, max_input_length, max_target_length
)

In [77]:
#tokenizer_pt.word_index.get('[START]')

In [62]:
#train_fr_sequences = tokenizer_fr.texts_to_sequences(fr_texts)
#train_pt_sequences = tokenizer_pt.texts_to_sequences(pt_texts)

In [63]:
# Seleciona algumas amostras do dataset de treino pré-processado
#for i, (fr, pt) in enumerate(train_dataset.take(3)):
#    print(f"\nAmostra {i+1}:")
#    print("Francês:  ", fr.numpy().decode('utf-8'))
#    print("Português:", pt.numpy().decode('utf-8'))
#
#val_fr_texts, val_pt_texts = [], []
#for fr, pt in val_dataset.take(MAX_SAMPLES):
#    val_fr_texts.append(fr.numpy().decode('utf-8'))
#    val_pt_texts.append(pt.numpy().decode('utf-8'))
#
#val_fr_sequences = tokenizer_fr.texts_to_sequences(val_fr_texts)
#val_pt_sequences = tokenizer_pt.texts_to_sequences(val_pt_texts)


In [64]:
#max_input_length = max(len(seq) for seq in train_fr_sequences)
#max_target_length = max(len(seq) for seq in train_pt_sequences)

In [65]:
#train_decoder_inputs = [seq[:-1] for seq in train_pt_sequences]
#train_decoder_outputs = [seq[1:] for seq in train_pt_sequences]
#
#val_decoder_inputs = [seq[:-1] for seq in val_pt_sequences]
#val_decoder_outputs = [seq[1:] for seq in val_pt_sequences]

In [66]:
# Sequências pad
#train_encoder_inputs = pad_sequences(train_fr_sequences, maxlen=max_input_length, padding='post')
#train_decoder_inputs = pad_sequences(train_decoder_inputs, maxlen=(max_target_length-1), padding='post')
#train_decoder_outputs = pad_sequences(train_decoder_outputs, maxlen=(max_target_length-1), padding='post')
#
#val_encoder_inputs = pad_sequences(val_fr_sequences, maxlen=max_input_length, padding='post')
#val_decoder_inputs = pad_sequences(val_decoder_inputs, maxlen=(max_target_length-1), padding='post')
#val_decoder_outputs = pad_sequences(val_decoder_outputs, maxlen=(max_target_length-1), padding='post')

# 🤖 Modelo 1: LSTM Básico
### Vamos começar com o modelo mais simples: um **Encoder-Decoder** usando **LSTM**

In [29]:
def build_lstm_model(input_vocab_size, target_vocab_size,
                     max_input_len, max_target_len,
                     latent_units=256, embedding_dim=256):
    # Encoder
    encoder_inputs = Input(shape=(max_input_len,), name="encoder_inputs")
    encoder_embed = Embedding(input_vocab_size, embedding_dim, name="encoder_embedding")(encoder_inputs)
    encoder_lstm = LSTM(latent_units, return_state=True, name="encoder_lstm")
    _, state_h, state_c = encoder_lstm(encoder_embed)

    # Decoder (teacher forcing during training)
    decoder_inputs = Input(shape=(max_target_len-1,), name="decoder_inputs")
    decoder_embed = Embedding(target_vocab_size, embedding_dim, name="decoder_embedding")(decoder_inputs)
    decoder_lstm = LSTM(latent_units, return_sequences=True, name="decoder_lstm")
    decoder_outputs = decoder_lstm(decoder_embed, initial_state=[state_h, state_c])
    decoder_dense = Dense(target_vocab_size, activation='softmax', dtype='float32', name="decoder_dense")
    decoder_outputs = decoder_dense(decoder_outputs)

    # Build the full model and attach useful layers for inference.
    model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
    model.encoder_inputs = encoder_inputs
    model.encoder_embedding = model.get_layer("encoder_embedding")
    model.encoder_lstm = model.get_layer("encoder_lstm")
    model.decoder_embedding = model.get_layer("decoder_embedding")
    model.decoder_lstm = model.get_layer("decoder_lstm")
    model.decoder_dense = decoder_dense
    return model


## Criando e Compilando o Modelo

In [31]:
lstm_model = build_lstm_model(
    input_vocab_size=len(fr_tokenizer.word_index)+1,
    target_vocab_size=len(pt_tokenizer.word_index)+1,
    max_input_len=max_fr,
    max_target_len=max_pt
)

lstm_model.compile(
    optimizer=Adam(learning_rate=1e-4, clipnorm=1.0),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

lstm_model.summary()

## Treinando o Modelo

In [33]:
# Treinamento
history = lstm_model.fit(
    [train_enc, train_dec_in], train_dec_out,
    validation_data=prepare_datasets(fr_tokenizer, pt_tokenizer, val_dataset, MAX_SAMPLES)[0],
    #[train_encoder_inputs, train_decoder_inputs], train_decoder_outputs,
    #validation_data=([val_encoder_inputs, val_decoder_inputs], val_decoder_outputs),
    epochs=5,
    batch_size=10,
    validation_split=0.2,
    verbose=1
)

Epoch 1/5
[1m4388/4388[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step - accuracy: 0.8836 - loss: 1.8984

ValueError: Layer "functional_2" expects 2 input(s), but it received 1 input tensors. Inputs received: [<tf.Tensor 'data:0' shape=(None, 109) dtype=int32>]

# 🤖 Modelo 2: LSTM (Luong)


In [34]:
class LuongAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super().__init__()
        self.W = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, query, values):
        query = tf.expand_dims(query, 1)
        score = self.V(tf.nn.tanh(self.W(query + values)))
        attention_weights = tf.nn.softmax(score, axis=1)
        context_vector = attention_weights * values
        context_vector = tf.reduce_sum(context_vector, axis=1)
        return context_vector, attention_weights

In [35]:
def build_lstm_luong_attention_model(input_vocab_size, target_vocab_size, max_source_len, max_target_len,
                                    latent_units=256, embedding_dim=256):
    # Encoder
    encoder_inputs = Input(shape=(max_source_len,))
    enc_emb = Embedding(input_vocab_size, embedding_dim)(encoder_inputs)
    encoder_lstm = LSTM(latent_units, return_sequences=True, return_state=True)
    enc_outputs, state_h, state_c = encoder_lstm(enc_emb)

    # Decoder with Attention
    decoder_inputs = Input(shape=(max_target_len-1,))
    dec_emb = Embedding(target_vocab_size, embedding_dim)(decoder_inputs)
    decoder_lstm = LSTM(latent_units, return_sequences=True, return_state=True)
    dec_outputs, _, _ = decoder_lstm(dec_emb, initial_state=[state_h, state_c])

    attention = LuongAttention(latent_units)
    context_vector, _ = attention(dec_outputs, enc_outputs)
    dec_outputs = tf.concat([context_vector, dec_outputs], axis=-1)

    decoder_dense = Dense(target_vocab_size, activation='softmax', dtype='float32')
    outputs = decoder_dense(dec_outputs)

    return Model([encoder_inputs, decoder_inputs], outputs)

In [36]:
lstm_attn_model = build_lstm_luong_attention_model(
    input_vocab_size=len(fr_tokenizer.word_index)+1,
    target_vocab_size=len(pt_tokenizer.word_index)+1,
    max_source_len=max_fr,
    max_target_len=max_pt
)

# optimizer=Adam(learning_rate=0.001)
lstm_attn_model.compile(
    optimizer=Adam(),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

lstm_attn_model.summary()

1. The `call()` method of your layer may be crashing. Try to `__call__()` the layer eagerly on some test input first to see if it works. E.g. `x = np.random.random((3, 4)); y = layer(x)`
2. If the `call()` method is correct, then you may need to implement the `def build(self, input_shape)` method on your layer. It should create all variables used by the layer (e.g. by calling `layer.build()` on all its children layers).
Exception encountered: ''Dimensions must be equal, but are 154 and 151 for '{{node add}} = AddV2[T=DT_HALF](ExpandDims, Placeholder_1)' with input shapes: [?,1,154,256], [?,151,256].''


ValueError: Exception encountered when calling LuongAttention.call().

[1mCould not automatically infer the output shape / dtype of 'luong_attention' (of type LuongAttention). Either the `LuongAttention.call()` method is incorrect, or you need to implement the `LuongAttention.compute_output_spec() / compute_output_shape()` method. Error encountered:

Dimensions must be equal, but are 154 and 151 for '{{node add}} = AddV2[T=DT_HALF](ExpandDims, Placeholder_1)' with input shapes: [?,1,154,256], [?,151,256].[0m

Arguments received by LuongAttention.call():
  • args=('<KerasTensor shape=(None, 154, 256), dtype=float16, sparse=False, name=keras_tensor_35>', '<KerasTensor shape=(None, 151, 256), dtype=float16, sparse=False, name=keras_tensor_30>')
  • kwargs=<class 'inspect._empty'>

In [None]:
lstm_attn_model.fit(
    [train_enc, train_dec_in], train_dec_out,
    validation_data=prepare_datasets(fr_tokenizer, pt_tokenizer, val_dataset, MAX_SAMPLES)[0],
    # validation_split=0.2,
    epochs=10,
    batch_size=64,
    verbose=1
)

# 📊 Visualizando os Resultados

In [None]:
def visualize_results(history):
    plt.figure(figsize=(12, 4))

    # Plot da loss
    plt.subplot(1, 2, 1)
    plt.plot(history.history['loss'], label='Treino')
    plt.plot(history.history['val_loss'], label='Validação')
    plt.title('Loss ao Longo do Treinamento')
    plt.xlabel('Época')
    plt.ylabel('Loss')
    plt.legend()

    # Plot da acurácia
    plt.subplot(1, 2, 2)
    plt.plot(history.history['accuracy'], label='Treino')
    plt.plot(history.history['val_accuracy'], label='Validação')
    plt.title('Acurácia ao Longo do Treinamento')
    plt.xlabel('Época')
    plt.ylabel('Acurácia')
    plt.legend()

    plt.tight_layout()
    plt.show()

In [None]:
def plot_accuracy_by_position(y_true, y_pred):
    """
    Plots the token-level accuracy at each sequence position.

    Parameters:
      - y_true: numpy array of shape (num_samples, sequence_length)
                containing the ground truth token indices.
      - y_pred: numpy array of shape (num_samples, sequence_length)
                containing the predicted token indices.
    """
    seq_length = y_true.shape[1]
    accuracies = []
    for pos in range(seq_length):
        # Compute the fraction of tokens predicted correctly at this position.
        pos_accuracy = np.mean(y_true[:, pos] == y_pred[:, pos])
        accuracies.append(pos_accuracy)

    plt.figure(figsize=(10, 6))
    plt.plot(range(seq_length), accuracies, marker='o')
    plt.xlabel("Token Position")
    plt.ylabel("Accuracy")
    plt.title("Accuracy by Token Position")
    plt.ylim(0, 1)
    plt.grid(True)
    plt.show()


In [None]:
def plot_length_distribution(y_true, y_pred):
    # Calculando comprimentos (ignorando padding)
    true_lengths = [len([x for x in seq if x != 0]) for seq in y_true]
    pred_lengths = [len([x for x in seq if x != 0]) for seq in y_pred]

    plt.figure(figsize=(12, 6))
    plt.hist([true_lengths, pred_lengths], label=['Real', 'Previsto'],
             alpha=0.7, bins=20)
    plt.title('Distribuição do Comprimento das Traduções')
    plt.xlabel('Comprimento da Sequência')
    plt.ylabel('Frequência')
    plt.legend()
    plt.show()

In [None]:
def plot_confusion_heatmap(y_true, y_pred, token_labels):
    """
    Plots a confusion matrix (as a heatmap) for token predictions.

    Parameters:
      - y_true: numpy array of token indices (can be 2D or flattened)
      - y_pred: numpy array of token indices (can be 2D or flattened)
      - token_labels: list of strings that maps each token index to a label.
                      For example: ["PAD", "[START]", "[END]", "bonjour", ...]
    """
    # If the inputs are 2D (num_samples x seq_length), flatten them.
    if y_true.ndim > 1:
        y_true = y_true.flatten()
    if y_pred.ndim > 1:
        y_pred = y_pred.flatten()

    # Compute the confusion matrix.
    labels = np.arange(len(token_labels))
    cm = confusion_matrix(y_true, y_pred, labels=labels)

    plt.figure(figsize=(12, 10))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
                xticklabels=token_labels,
                yticklabels=token_labels)
    plt.xlabel("Predicted Token")
    plt.ylabel("True Token")
    plt.title("Confusion Matrix of Token Predictions")
    plt.show()

In [None]:
# For demonstration, assume we have 100 samples, each of length 10 tokens.
num_samples = 100
sequence_length = 10
vocab_size = 50  # For example, assume your vocabulary has 50 tokens (indices 0 to 49)

# Create some random demo ground truth and predicted token sequences.
np.random.seed(42)

# To-do: real data here.
y_true_demo = np.random.randint(0, vocab_size, size=(num_samples, sequence_length))
y_pred_demo = np.random.randint(0, vocab_size, size=(num_samples, sequence_length))

# Create a list of token labels for the confusion heatmap.
# (In your case, you might use tokenizer_pt.index_word but here we simulate labels.)
token_labels = [f"Token {i}" for i in range(vocab_size)]

# Plot accuracy by token position.
plot_accuracy_by_position(y_true_demo, y_pred_demo)

# Plot confusion heatmap.
plot_confusion_heatmap(y_true_demo, y_pred_demo, token_labels)

In [None]:
# 1. Acurácia por posição
val_data = [train_encoder_inputs[-1000:], train_decoder_inputs[-1000:]]  # Usando últimas 1000 amostras como validação
plot_accuracy_by_position(model, val_data[0], val_data[1])

# 2. Distribuição de comprimentos
predictions = model.predict([val_data[0], val_data[1][:, :-1]])
pred_classes = np.argmax(predictions, axis=-1)
plot_length_distribution(val_data[1][:, 1:], pred_classes)

# 3. Heatmap de confusão
plot_confusion_heatmap(val_data[1][:, 1:], pred_classes, tokenizer_pt)

print("\nLegenda das visualizações:")
print("1. Gráfico de Acurácia por Posição: Mostra como o modelo se comporta em diferentes posições da sequência")
print("2. Distribuição de Comprimentos: Compara o tamanho das traduções reais vs. previstas")
print("3. Heatmap: Mostra quais palavras frequentes são mais confundidas entre si")