# En-DE Translation with pretrained T5 base Sequence-to-Sequence Model

https://github.com/christianversloot/machine-learning-articles/blob/main/easy-machine-translation-with-machine-learning-and-huggingface-transformers.md

In [None]:
from transformers import pipeline

# Init translator
translator = pipeline("translation_en_to_de")

As model t5-base and revision 686f1db (https://huggingface.co/t5-base) is used.

The model has 223 million parameters. [https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints](https://github.com/google-research/text-to-text-transfer-transformer#released-model-checkpoints)

Every model requires a specific tokenizer.

The model is pre-trained on the [Colossal Clean Crawled Corpus](https://www.tensorflow.org/datasets/catalog/c4) (C4), which was developed and released in the context of the same research paper as T5.

The model card is [here](https://huggingface.co/google-t5/t5-base) and the paper by [Raffel et al. 2020](https://jmlr.org/papers/volume21/20-074/20-074.pdf).

## Tokenizer

In [None]:
text = "Hello my comrades! How are you doing today?"
toks=translator.tokenizer(text)
print("toks=",toks)
for id in toks['input_ids']:
  print(id,"\t",translator.tokenizer.decode(id))

In [None]:
translator.tokenizer.vocab

In [None]:
len(translator.tokenizer.vocab.values())

In [None]:
# Translate text
#text = "Hello my friends! How are you doing today?"
translation = translator(text)
print(translation)

In [None]:
text="""
Several European countries hit some of their sustainable energy targets for 2030
a decade early, a study has found, but big gaps remain across the board.

All EU member states made progress in the 2010s toward reaching the UN’s seventh
sustainable development goal, which calls for access to “affordable, reliable,
sustainable and modern energy for all” by 2030. For some indicators, several
countries had already reached the targets by 2021, the study by Polish economists
published on Wednesday found.

The ranking showed the country closest to the overall goal was Sweden, followed
by Denmark, Estonia and Austria. Malta improved the most, with big gains also
found in Cyprus, Latvia and Belgium – though these countries all had a long way
to go. Bulgaria was furthest from the goal.
"""
"""
The study reveals “systematic progress” towards reaching the goal, the researchers
wrote, “with differences between individual EU countries clearly decreasing”.

The economists combined seven metrics to get a single measure of countries’
progress toward the goal. The European Commission has set target values for three
of them, while for the rest, the researchers took the level reached by the top 10% of EU countries in 2015 as a proxy.

Several countries had already achieved their targets for 2030 in at least one
of the indicators by 2021, the research found.

Spain, Malta and Portugal, for instance, hit the target for the average amount
of energy a person consumes in a household. Denmark, Ireland and Luxembourg hit
the target for energy productivity, which compares the size of an economy with
the energy it consumes.
"""
translation = translator(text)
print(translation)

## T5 Model

* 32128 tokens
* max length of input 300,
* 12 encoder and decoder layers
* 12 attention heads
* embeddings of length 768
* regularization with Dropout and LayerNorm
* 4 beams for generating translations

In [None]:
translator.model.config

In [None]:
translator.model

### Translation to French
Loead a new model

In [None]:
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")

tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")

In [None]:
# translate German

tokenizer.src_lang = "de"
de_text = "Das Leben ist wie eine Tafel Schokolade."
encoded_de = tokenizer(de_text, return_tensors="pt")
encoded_de

In [None]:
generated_tokens = model.generate(**encoded_de, forced_bos_token_id=tokenizer.get_lang_id("zh"))  # 'hi', 'zh'

zh_text=tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
zh_text

The vocabulary is mixed from En and Fr.

In [None]:
tokenizer.src_lang = "zh"
de_text = "Das Leben ist wie eine Tafel Schokolade."
encoded_zh = tokenizer(zh_text, return_tensors="pt")
encoded_zh

In [None]:
generated_tokens_de = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("de"))  # 'hi', 'zh'

de_text1=tokenizer.batch_decode(generated_tokens_de, skip_special_tokens=True)
de_text1

In [None]:
model.config