<a href="https://colab.research.google.com/github/Deon62/Eng-German-Translator-model/blob/main/ModelFT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

will be fine tuning a model to translate english words to german


In [1]:
!pip install -q  sacrebleu
import torch

print("Torch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())


Torch: 2.8.0+cu126
CUDA available: True


In [None]:
import torch

print("CUDA Device Count:", torch.cuda.device_count())
print("Current Device:", torch.cuda.current_device())
print("Device Name:", torch.cuda.get_device_name(0))


CUDA Device Count: 1
Current Device: 0
Device Name: Tesla T4


In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

In [3]:
# we will now load our dataset, just 1% of it

train_ds = load_dataset("wmt16", "de-en", split="train[:1%]")
val_ds = load_dataset("wmt16", "de-en", split="validation[:1%]")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
# we will shuffle for randomness
train_ds = train_ds.shuffle(seed=42)
val_ds = val_ds.shuffle(seed=42)

In [5]:
# Create train/validation split (90% train / 10% valid)
spilt = train_ds.train_test_split(test_size=0.1, seed=42)
train_raw = spilt["train"]
val_raw = spilt["test"]
test_raw = val_ds


In [6]:
print(train_raw[0])

{'translation': {'de': 'Ich würde zur Zulassung einer großen Zahl von Projekten raten, weil klar ist, daß die kulturellen Projekte sehr viele Investitionen Autonomer Gemeinschaften und privater Initiativen anziehen, die uns nicht verloren gehen dürfen.', 'en': 'I would recommend that many projects be allowed since it is clear that cultural projects attract a lot of investment from Autonomous Regions and private initiatives which we cannot afford to lose.'}}


In [7]:
def show_examples(ds, n=3):
    for i in range(n):
        ex = ds[i]["translation"]
        print(f"EN: {ex['en']}\nDE: {ex['de']}\n---")

print("TRAIN SAMPLES:")
show_examples(train_raw)


TRAIN SAMPLES:
EN: I would recommend that many projects be allowed since it is clear that cultural projects attract a lot of investment from Autonomous Regions and private initiatives which we cannot afford to lose.
DE: Ich würde zur Zulassung einer großen Zahl von Projekten raten, weil klar ist, daß die kulturellen Projekte sehr viele Investitionen Autonomer Gemeinschaften und privater Initiativen anziehen, die uns nicht verloren gehen dürfen.
---
EN: We constantly find ourselves having to deal with this contradiction, particularly as regards the rights of the child, an area where there are cases in which more than one country is involved and in which the victims are not only not notified by the state where the judgement is made, but are also deprived of legal support.
DE: Und mit diesem Widerspruch sehen wir uns vor allem in bezug auf die Rechte der Kinder permanent konfrontiert. In diesem Bereich gibt es Fälle, bei denen mehrere Länder beteiligt sind und die Opfer von dem Land, in d

In [8]:
MODEL_CHECKPOINT = "t5-small"
PREFIX = "translate English to German: "

In [9]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

MAX_LEN = 128  # max token length

In [10]:
def preprocess_fn(batch):
    # Extract English + German pairs
    en_texts = [ex["en"] for ex in batch["translation"]]
    de_texts = [ex["de"] for ex in batch["translation"]]

    # Add prefix to English (T5 needs task info)
    model_inputs = tokenizer([PREFIX + text for text in en_texts],
                             max_length=MAX_LEN, truncation=True)

    # Tokenize German target text
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(de_texts, max_length=MAX_LEN, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [11]:
# applying preprocessing to train, valid and test
train_ds = train_raw.map(preprocess_fn, batched=True, remove_columns= train_raw.column_names )
val_ds = val_raw.map(preprocess_fn, batched=True, remove_columns= val_raw.column_names )
test_ds = test_raw.map(preprocess_fn, batched=True, remove_columns= test_raw.column_names )

Map:   0%|          | 0/4549 [00:00<?, ? examples/s]



In [12]:
print(train_ds[0])

{'input_ids': [13959, 1566, 12, 2968, 10, 27, 133, 1568, 24, 186, 1195, 36, 2225, 437, 34, 19, 964, 24, 2779, 1195, 5521, 3, 9, 418, 13, 1729, 45, 2040, 3114, 1162, 6163, 7, 11, 1045, 6985, 84, 62, 1178, 5293, 12, 2615, 5, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [1674, 6368, 881, 1811, 20766, 645, 3, 6403, 10436, 193, 6593, 35, 1080, 29, 6, 5603, 8330, 229, 6, 3, 26, 7118, 67, 3, 25739, 29, 16356, 1319, 2584, 26709, 35, 2040, 3114, 49, 23961, 35, 64, 1045, 52, 12043, 29, 46, 7376, 6, 67, 1149, 311, 20098, 7455, 12443, 5, 1]}


In [13]:
print("Input IDs:", train_ds[0]["input_ids"][:20])
print("Labels:", train_ds[0]["labels"][:20])

# Decoded to human readability
print("Decoded Input:", tokenizer.decode(train_ds[0]["input_ids"]))
print("Decoded Label:", tokenizer.decode([id for id in train_ds[0]["labels"] if id != -100]))


Input IDs: [13959, 1566, 12, 2968, 10, 27, 133, 1568, 24, 186, 1195, 36, 2225, 437, 34, 19, 964, 24, 2779, 1195]
Labels: [1674, 6368, 881, 1811, 20766, 645, 3, 6403, 10436, 193, 6593, 35, 1080, 29, 6, 5603, 8330, 229, 6, 3]
Decoded Input: translate English to German: I would recommend that many projects be allowed since it is clear that cultural projects attract a lot of investment from Autonomous Regions and private initiatives which we cannot afford to lose.</s>
Decoded Label: Ich würde zur Zulassung einer großen Zahl von Projekten raten, weil klar ist, daß die kulturellen Projekte sehr viele Investitionen Autonomer Gemeinschaften und privater Initiativen anziehen, die uns nicht verloren gehen dürfen.</s>


In [14]:
# load model
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT)


In [15]:
# load collector
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [16]:
import evaluate
import numpy as np

bleu = evaluate.load("sacrebleu")

In [17]:
def postprocess_text(preds, labels):
    preds = [p.strip() for p in preds]
    labels = [[l.strip()] for l in labels]  # sacrebleu expects list of lists
    return preds, labels

def compute_metrics(eval_pred):
    preds, labels = eval_pred

    # Replace -100 (ignore index) with pad_token_id before decoding
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    # Decode
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Postprocess
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    # Compute BLEU
    result = bleu.compute(predictions=decoded_preds, references=decoded_labels)

    return {"bleu": result["score"]}


In [18]:
# collator batch
batch = data_collator([train_ds[i] for i in range(2)])
print(batch.keys())
print("Input shape:", batch["input_ids"].shape)

KeysView({'input_ids': tensor([[13959,  1566,    12,  2968,    10,    27,   133,  1568,    24,   186,
          1195,    36,  2225,   437,    34,    19,   964,    24,  2779,  1195,
          5521,     3,     9,   418,    13,  1729,    45,  2040,  3114,  1162,
          6163,     7,    11,  1045,  6985,    84,    62,  1178,  5293,    12,
          2615,     5,     1,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0],
        [13959,  1566,    12,  2968,    10,   101,  4259,   253,  3242,   578,
            12,  1154,    28,    48, 27252,     6,  1989,    38,  9544,     8,
          2166,    13,     8,   861,     6,    46,   616,   213,   132,    33,
          1488,    16,    84,    72,   145,    80,   684,    19,  1381,    11,
            16,    84,     8,  8926,    33,    59,   163,    59,     3, 15195,
            57,     8,   538,   213

In [19]:
# Test compute_metrics function (dummy preds)
sample_preds = tokenizer(["Hallo Welt"], return_tensors="np", padding=True)["input_ids"]
sample_labels = tokenizer(["Hallo Welt"], return_tensors="np", padding=True)["input_ids"]

print(compute_metrics((sample_preds, sample_labels)))

{'bleu': 0.0}


In [20]:
from transformers import TrainingArguments

BATCH_SIZE = 8
EPOCHS = 3
LR = 5e-5


In [None]:
!pip uninstall -y transformers
!pip install -U transformers accelerate datasets


Found existing installation: transformers 4.55.4
Uninstalling transformers-4.55.4:
  Successfully uninstalled transformers-4.55.4
Collecting transformers
  Using cached transformers-4.55.4-py3-none-any.whl.metadata (41 kB)
Collecting accelerate
  Downloading accelerate-1.10.1-py3-none-any.whl.metadata (19 kB)
Using cached transformers-4.55.4-py3-none-any.whl (11.3 MB)
Downloading accelerate-1.10.1-py3-none-any.whl (374 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.9/374.9 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers, accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 1.10.0
    Uninstalling accelerate-1.10.0:
      Successfully uninstalled accelerate-1.10.0
Successfully installed accelerate-1.10.1 transformers-4.55.4


In [21]:
import transformers
print(transformers.__version__)


4.55.4


In [23]:
training_args = Seq2SeqTrainingArguments(
    output_dir="t5-en-de-out",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=LR,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=0.01,
    logging_dir="./logs",
    predict_with_generate=True,
    fp16=True
)


In [29]:
train_subset = train_ds.select(range(min(len(train_ds), 2000)))
eval_subset = test_ds.select(range(min(len(test_ds), 200)))


In [30]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_subset,
    eval_dataset=eval_subset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


  trainer = Seq2SeqTrainer(


In [31]:
trainer.train()


  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33m625deon[0m ([33m625deon-egerton-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Bleu
1,No log,0.988821,19.596361
2,1.289000,0.991289,19.489041
3,1.289000,0.991105,19.489041


TrainOutput(global_step=750, training_loss=1.2735617268880208, metrics={'train_runtime': 434.0097, 'train_samples_per_second': 13.825, 'train_steps_per_second': 1.728, 'total_flos': 110269732749312.0, 'train_loss': 1.2735617268880208, 'epoch': 3.0})

In [32]:
sample_texts = [ex["en"] for ex in test_raw["translation"][:5]]
print(sample_texts)


['It is claimed Webster attacked her while she was "unconscious, asleep and incapable of giving consent."', 'Karratha Police have charged a 20-year-old man with failing to stop and reckless driving.', "He is alleged to have raped a woman at the Scotland's Hotel in Pitlochry in Perthshire on June 7, 2013.", 'Congressmen Keith Ellison and John Lewis have proposed legislation to protect union organizing as a civil right.', 'The motorcycle was seized and impounded for three months.']


In [33]:
# Tokenize with prefix (important for T5)
inputs = tokenizer([PREFIX + text for text in sample_texts],
                   return_tensors="pt", padding=True, truncation=True).to(model.device)

# Generate translations
outputs = model.generate(**inputs, max_length=MAX_LEN)

# Decode
translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)

for en, de in zip(sample_texts, translations):
    print(f"EN: {en}")
    print(f"DE: {de}")
    print("-" * 40)


EN: It is claimed Webster attacked her while she was "unconscious, asleep and incapable of giving consent."
DE: Es wird behauptet, Webster habe sie angegriffen, als sie "unbewusst, schlafend und unfähig war, ihre Zustimmung zu geben".
----------------------------------------
EN: Karratha Police have charged a 20-year-old man with failing to stop and reckless driving.
DE: Die Polizei von Karratha hat einen 20-jährigen Mann angeklagt, er sei nicht daran gehindert und fahrlässig zu fahren.
----------------------------------------
EN: He is alleged to have raped a woman at the Scotland's Hotel in Pitlochry in Perthshire on June 7, 2013.
DE: Er soll am 7. Juni 2013 eine Frau im Scottish's Hotel in Pitlochry in Perthshire vergewaltigt haben.
----------------------------------------
EN: Congressmen Keith Ellison and John Lewis have proposed legislation to protect union organizing as a civil right.
DE: Die Kongressabgeordneten Keith Ellison und John Lewis haben Gesetze vorgeschlagen, um die Ge

In [34]:
predictions = trainer.predict(eval_subset)


In [35]:
print(predictions)

PredictionOutput(predictions=array([[    0,  1122,   551,    36, 16512,    15,    17,     6,  1620,
         1370,  2010,   680,     3,  3280, 11442,    35,     6,   501,
          680,    96,   202],
       [    0,   316, 16483,   193,  4556,  1795,  1024,     3,   547,
          595,   460,    18, 20025,  6362,     3,  3280,   157,  5430,
           17,     6,     3],
       [    0,   848,  3775,   183,  4306, 12170,  2038,   266,  7672,
          256, 12580,    31,     7,  2282,    16, 13430, 23654,   651,
           16, 22343,  5718],
       [    0,   316,  2974, 10292,     9,   115, 19522,    35, 17017,
        22342,   106,    64,  1079,  9765,   745,   961,  2244,   776,
          426, 24883,     6],
       [    0,   644,  5083,  5672,  1177,  4052, 17605, 12142,    36,
         9444, 18069,    17,    64,    36, 28539,     5,     1,     0,
            0,     0,     0],
       [    0,  3080,  1620,  1370,     6,  2059,  3861,  4445,     6,
            3,   547,   289,     3,  572

In [36]:
model.save_pretrained("t5-en-de-translator")
tokenizer.save_pretrained("t5-en-de-translator")


('t5-en-de-translator/tokenizer_config.json',
 't5-en-de-translator/special_tokens_map.json',
 't5-en-de-translator/spiece.model',
 't5-en-de-translator/added_tokens.json',
 't5-en-de-translator/tokenizer.json')

In [38]:
from huggingface_hub import notebook_login

# Login once
notebook_login()




VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [43]:

repo_id = "chinesemusk/t5-en-de-translator"

# Push model + tokenizer
model.push_to_hub(repo_id)
tokenizer.push_to_hub(repo_id)

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  t5-en-de-translator/model.safetensors :   0%|          |  551kB /  242MB            

README.md: 0.00B [00:00, ?B/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  t5-en-de-translator/spiece.model      : 100%|##########|  792kB /  792kB            

CommitInfo(commit_url='https://huggingface.co/chinesemusk/t5-en-de-translator/commit/4ddba9a6ed3e684bc4be20b226ef52e3f30efb92', commit_message='Upload tokenizer', commit_description='', oid='4ddba9a6ed3e684bc4be20b226ef52e3f30efb92', pr_url=None, repo_url=RepoUrl('https://huggingface.co/chinesemusk/t5-en-de-translator', endpoint='https://huggingface.co', repo_type='model', repo_id='chinesemusk/t5-en-de-translator'), pr_revision=None, pr_num=None)

In [44]:
# use the model from HF

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("chinesemusk/t5-en-de-translator")
tokenizer = AutoTokenizer.from_pretrained("chinesemusk/t5-en-de-translator")

text = "Das ist ein Test."
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Das ist ein Test.
