# Language translation

In this recipe, we will use transformers for language translation. We will use the **Google Text-To-Text Transfer Transformer (T5)** model. This model is an end-to-end model that uses both the encoder and decoder components of the transformer model.

How to do it...

In this recipe, you will initialize a seed sentence in English and translate it to French. The T5 model expects the input format to encode the information about the language translation task along with the seed sentence. In this case, the encoder uses the input in the source language and generates a representation of the text. The decoder uses this representation and generates text for the target language. The T5 model is trained specifically for this task, in addition to many others.

Import libraries

In [3]:
from transformers import (
    T5Tokenizer, T5ForConditionalGeneration
)
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Initialize a tokenizer and model instance with the t5-base model from Google. We use the model_max_length parameter of 200

In [4]:
tokenizer = T5Tokenizer.from_pretrained(
    "t5-base", model_max_length = 200
)
model = T5ForConditionalGeneration.from_pretrained(
    "t5-base", return_dict = True
)
model = model.to(device)

Initialize a seed sequence that you want to translate:

In [5]:
language_sequence = ("It's such a beautiful morning today!")

Tokenize the input sequence. The tokenizer specifies the source and the target language as part of its input encoding. This is done by appending the “translate English to French:” text to the input seed sequence. We load these token IDs into the device that is used for computation. It is a requirement for both the model and the token IDs to be on the same device:

In [6]:
input_ids = tokenizer(
    "translate English to French: " + language_sequence,
    return_tensors = "pt",
    truncation = True).input_ids.to(device)

Translate the source language token IDs to the target language token IDs via the model. The model uses the encoder-decoder architecture to convert the input token IDs to the output token IDs:

In [7]:
language_ids = model.generate(input_ids,
                             max_new_tokens = 200)

Decode the text from the token IDs to the target language tokens. We use the tokenizer to convert the output token IDs to the target language tokens:

In [8]:
language_translation = tokenizer.decode(language_ids[0],
                                        skip_special_tokens = True)

Print the translated output:

In [9]:
print(language_translation)

C'est un beau matin aujourd'hui!
