### English to Spanish Translation Using Hugging Face

In this tutorial, you'll learn how to use the Hugging Face `transformers` and `datasets` libraries to perform English to Spanish translation. We'll use the Tatoeba dataset for example sentences and the Helsinki-NLP's pretrained model for translation tasks.

#### Installs

In [5]:
!pip install transformers datasets

[33mDEPRECATION: pytorch-lightning 1.7.7 has a non-standard dependency specifier torch>=1.9.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pytorch-lightning or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

#### Step 1: Download the Tatoeba Dataset
First, we need to download the Tatoeba dataset, which contains pairs of sentences in various languages, including English and Spanish.

In [6]:
from datasets import load_dataset

# Load the Tatoeba dataset
dataset = load_dataset("tatoeba", lang1="en", lang2="es", trust_remote_code=True)

#### Step 2: Load the Pretrained Translation Model
We will use the Helsinki-NLP model trained for English to Spanish translation. This model is available on the Hugging Face model hub.

In [7]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "Helsinki-NLP/opus-mt-en-es"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

#### Step 3: Translate Sentences from the Dataset

To demonstrate how the pre-trained model works, we'll translate a few examples from the Tatoeba training set. The data is a little different than the original training data so you should see some minor differences.


In [21]:
def translate(text):
    # Encode the text
    encoded_text = tokenizer(text, return_tensors="pt", padding=True)

    # Generate translation
    translated_tokens = model.generate(**encoded_text)

    # Decode and return the translation
    translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
    return translated_text

# Select a few sentences from the dataset (you can change these)
ids = [10,20,30,40,50]
examples = [dataset["train"]["translation"][id] for id in ids]

for example in examples:
    print("Input English:", example["en"])
    print("Target Spanish:", example["es"])
    translated_text = translate(example["en"])
    print("Predicted Spanish:", translated_text)
    print()

Input English: Muiriel is 20 now.
Target Spanish: Ahora, Muiriel tiene 20 años.
Predicted Spanish: Muiriel ahora tiene 20 años.

Input English: This is never going to end.
Target Spanish: Esto no terminará jamás.
Predicted Spanish: Esto nunca va a terminar.

Input English: You're in better shape than I am.
Target Spanish: Estás en mejor forma que yo.
Predicted Spanish: Estás en mejor forma que yo.

Input English: That won't happen.
Target Spanish: Eso no acontecerá.
Predicted Spanish: Eso no sucederá.

Input English: I'll call them tomorrow when I come back.
Target Spanish: Les llamaré mañana cuando regrese.
Predicted Spanish: Los llamaré mañana cuando vuelva.



#### Step 4: Translate a New Sentence
Finally, demonstrate how to use the model to translate a new sentence that isn't from the dataset.

In [22]:
new_sentence = "Hello, how are you today?"
translation = translate(new_sentence)
print("New Sentence:", new_sentence)
print("Translation:", translation)

New Sentence: Hello, how are you today?
Translation: Hola, ¿cómo estás hoy?
