Training an encoder-decoder model for translation using a dataset like Tatoeba, specifically translating from English to Spanish, can be a great way to get hands-on experience with sequence-to-sequence models. Below, I'll outline the steps to set up and train such a model using popular machine learning libraries like PyTorch and Hugging Face's Transformers.

### Step 1: Setup Your Environment

First, ensure you have Python installed and create a virtual environment to manage your dependencies. Then, install the necessary libraries.  You likely already have torch and numpy, but it may be neceesary to install the Hugging Face libraries:

In [None]:
!pip install transformers datasets sentencepiece accelerate sacremoses

### Step 2: Download and Prepare the Data

You can use the `datasets` library from Hugging Face to load the [Tatoeba dataset](https://tatoeba.org/en/) filtered for English-Spanish pairs:

In [16]:
from datasets import load_dataset

# Load the dataset for English to Spanish (must use "en" and "es" not "eng" and "spa")
dataset = load_dataset("tatoeba", lang1="en", lang2="es", trust_remote_code=True)

### Step 3: Preprocess the Data

You'll need to tokenize the text data, convert it to tensor format, and create data loaders for training and evaluation. Here’s how you could set up the tokenizer and preprocess the data:

In [17]:
from transformers import AutoTokenizer

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-es")

def preprocess_function(examples):
    inputs = [ex["en"] for ex in examples["translation"]]
    targets = [ex["es"] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding="max_length")

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=128, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply the preprocessing function
tokenized_dataset = dataset.map(preprocess_function, batched=True)

# Split the dataset into training and validation sets
train_test_split = tokenized_dataset["train"].train_test_split(test_size=0.1)  # 10% for validation

# Combine the splits into a single DatasetDict
split_dataset = DatasetDict({
    'train': train_test_split['train'],
    'validation': train_test_split['test']
})

### Step 4: Initialize the Model

You will use a pretrained model that is suitable for translation. For English to Spanish, you can use a model like "Helsinki-NLP/opus-mt-en-es" from Hugging Face's Model Hub:

In [18]:
from transformers import AutoModelForSeq2SeqLM

# Load the pretrained model
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-es")

### Step 5: Define Training Arguments and Train the Model

**You should probably skip the step for learning purposes.  Most of the transformer models are pretty large so even a few epochs for fine tuning will be relatively slow yet much faster than training from scratch.  This cell shows you how to do it if you want use a dataset which is significantly different than common text.**

You can train your model using the `Trainer` API from Hugging Face. Set up your training arguments and start training:

In [13]:
from transformers import Trainer, TrainingArguments

fine_tune = False

if fine_tune: 
    # Define training arguments
    training_args = TrainingArguments(
        output_dir="./results",
        evaluation_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=3,
        weight_decay=0.01,
    )
    
    # Initialize the Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=split_dataset["train"],
        eval_dataset=split_dataset["validation"]
    )
    
    # Train the model
    trainer.train()

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

Let's print a few examples and predictions from the validation set to see how it appears to be working.

In [23]:
import random
import torch

def print_translation_examples(n_examples=5):
    
    model.eval() 
    
    # Randomly select examples from the validation set
    examples = random.sample(list(split_dataset['validation']), n_examples)

    for example in examples:
        input_text = example['translation']['en']
        target_text = example['translation']['es']

        # Tokenize the input text and convert to tensor
        inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True, padding="max_length")

        # Generate translation using the model
        with torch.no_grad():
            outputs = model.generate(**inputs)

        # Decode the model output to text
        translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Print the results
        print(f"Input: {input_text}")
        print(f"Target: {target_text}")
        print(f"Predicted Translation: {translated_text}")
        print("-----")

# Call the function
print_translation_examples(n_examples=5)


Input: Tom told me you're only planning on staying here for three days.
Target: Tom me dijo que sólo planeas quedarte aquí tres días.
Predicted Translation: Tom me dijo que sólo planeas quedarte aquí tres días.
-----
Input: Did you speak with your wife?
Target: ¿Has hablado con tu esposa?
Predicted Translation: ¿Hablaste con tu esposa?
-----
Input: You think that it will work?
Target: ¿Piensas que funcionará?
Predicted Translation: ¿Crees que funcionará?
-----
Input: That's not a cat. That's a dog.
Target: No es un gato. Es un perro.
Predicted Translation: Eso no es un gato, es un perro.
-----
Input: I'd like to drink some tea or coffee.
Target: Querría tomar un poco de té o café.
Predicted Translation: Me gustaría tomar un poco de té o café.
-----


### Step 6: Evaluate and Use the Model

After training, you can use the model to translate new sentences and evaluate its performance on a test set:

In [25]:
# Translate a new sentence
inputs = tokenizer("Class is finished for today.", return_tensors="pt")
translated_tokens = model.generate(**inputs)
translated_sentence = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
print(translated_sentence)

La clase está terminada por hoy.


This setup provides a complete workflow from data loading and preprocessing to training and using a machine translation model. Adjust the parameters and configurations based on your specific requirements and available computational resources.