<a href="https://colab.research.google.com/github/HAL22/EngToXhTranslation/blob/main/Translation(en_xh)_(TensorFlow).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Translation (TensorFlow)

Install the Transformers and Datasets libraries to run this notebook.

In [3]:
!pip install datasets transformers[sentencepiece]
!apt install git-lfs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


You will need to setup git, adapt your email and name in the following cell.

In [4]:
!git config --global user.email "thethelafaltein@gmail.com"
!git config --global user.name "HAL22"

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials.

In [5]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [6]:
from datasets import load_dataset, load_metric

raw_datasets = load_dataset("kde4", lang1="en", lang2="xh")

Using custom data configuration en-xh-lang1=en,lang2=xh
Reusing dataset kde4 (/root/.cache/huggingface/datasets/kde4/en-xh-lang1=en,lang2=xh/0.0.0/243129fb2398d5b0b4f7f6831ab27ad84774b7ce374cf10f60f6e1ff331648ac)


  0%|          | 0/1 [00:00<?, ?it/s]

In [7]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 66377
    })
})

In [8]:
split_datasets = raw_datasets["train"].train_test_split(train_size=0.9, seed=20)
split_datasets

Loading cached split indices for dataset at /root/.cache/huggingface/datasets/kde4/en-xh-lang1=en,lang2=xh/0.0.0/243129fb2398d5b0b4f7f6831ab27ad84774b7ce374cf10f60f6e1ff331648ac/cache-ed504c8507b99077.arrow and /root/.cache/huggingface/datasets/kde4/en-xh-lang1=en,lang2=xh/0.0.0/243129fb2398d5b0b4f7f6831ab27ad84774b7ce374cf10f60f6e1ff331648ac/cache-f7fde84990ca1f98.arrow


DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 59739
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 6638
    })
})

In [9]:
split_datasets["validation"] = split_datasets.pop("test")

In [10]:
split_datasets["validation"][1]["translation"]

{'en': 'Click here to configure the event notification',
 'xh': 'Nqakraza apha xa ufuna ukunqakraza isiganeko solwaziso'}

In [11]:
from transformers import pipeline

model_checkpoint = "Helsinki-NLP/opus-mt-en-xh"
translator = pipeline("translation", model=model_checkpoint)
translator("Default to expanded threads")



[{'translation_text': 'Okwendalo kwimisonto yokwandisa'}]

In [13]:
split_datasets["train"][172]["translation"]

{'en': 'Edit With...', 'xh': 'Hlela...'}

In [14]:
translator(
    "My email is quite long."
)

[{'translation_text': 'Iposi yam inde kakhulu.'}]

In [12]:
from transformers import AutoTokenizer

model_checkpoint = "Helsinki-NLP/opus-mt-en-xh"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="tf")



In [15]:
en_sentence = split_datasets["train"][1]["translation"]["en"]
xh_sentence = split_datasets["train"][1]["translation"]["xh"]

inputs = tokenizer(en_sentence)
with tokenizer.as_target_tokenizer():
    targets = tokenizer(xh_sentence)

In [16]:
wrong_targets = tokenizer(xh_sentence)
print(tokenizer.convert_ids_to_tokens(wrong_targets["input_ids"]))
print(tokenizer.convert_ids_to_tokens(targets["input_ids"]))

['▁U', 'my', 'ale', 'zo', '▁Pha', 'w', 'ula', '▁U', 'm', 'son', 'to', '</s>']
['▁Umyalezo', '▁Phawula', '▁Umsonto', '</s>']


In [17]:
max_input_length = 128
max_target_length = 128


def preprocess_function(examples):
    inputs = [ex["en"] for ex in examples["translation"]]
    targets = [ex["xh"] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Set up the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [18]:
print(split_datasets)
tokenized_datasets = split_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=split_datasets["train"].column_names,
)



DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 59739
    })
    validation: Dataset({
        features: ['id', 'translation'],
        num_rows: 6638
    })
})


  0%|          | 0/60 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

In [19]:
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_pt=True)

All PyTorch model weights were used when initializing TFMarianMTModel.

All the weights of TFMarianMTModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


In [20]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

In [21]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])
batch.keys()

dict_keys(['input_ids', 'attention_mask', 'labels', 'decoder_input_ids'])

In [22]:
batch["labels"]

<tf.Tensor: shape=(2, 4), dtype=int32, numpy=
array([[ 5133,  3668, 41634,     0],
       [ 1571, 29891,     0,  -100]], dtype=int32)>

In [23]:
batch["decoder_input_ids"]

<tf.Tensor: shape=(2, 4), dtype=int32, numpy=
array([[61284,  5133,  3668, 41634],
       [61284,  1571, 29891,     0]], dtype=int32)>

In [24]:
for i in range(1, 3):
    print(tokenized_datasets["train"][i]["labels"])

[5133, 3668, 41634, 0]
[1571, 29891, 0]


In [25]:
tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
    columns=["input_ids", "attention_mask", "labels"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
)
tf_eval_dataset = tokenized_datasets["validation"].to_tf_dataset(
    columns=["input_ids", "attention_mask", "labels"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=16,
)

In [26]:
!pip install sacrebleu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [27]:
from datasets import load_metric

metric = load_metric("sacrebleu")

In [28]:
predictions = [
    "This plugin lets you translate web pages between several languages automatically."
]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)

{'bp': 0.9200444146293233,
 'counts': [11, 6, 4, 3],
 'precisions': [91.66666666666667,
  54.54545454545455,
  40.0,
  33.333333333333336],
 'ref_len': 13,
 'score': 46.750469682990165,
 'sys_len': 12,
 'totals': [12, 11, 10, 9]}

In [27]:
predictions = ["This This This This"]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)

{'bp': 0.10539922456186433,
 'counts': [1, 0, 0, 0],
 'precisions': [25.0, 16.666666666666668, 12.5, 12.5],
 'ref_len': 13,
 'score': 1.683602693167689,
 'sys_len': 4,
 'totals': [4, 3, 2, 1]}

In [28]:
predictions = ["This plugin"]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)

{'bp': 0.004086771438464067,
 'counts': [2, 1, 0, 0],
 'precisions': [100.0, 100.0, 0.0, 0.0],
 'ref_len': 13,
 'score': 0.0,
 'sys_len': 2,
 'totals': [2, 1, 0, 0]}

In [29]:
import numpy as np


def compute_metrics():
    all_preds = []
    all_labels = []
    sampled_dataset = tokenized_datasets["validation"].shuffle().select(range(200))
    tf_generate_dataset = sampled_dataset.to_tf_dataset(
        columns=["input_ids", "attention_mask", "labels"],
        collate_fn=data_collator,
        shuffle=False,
        batch_size=4,
    )
    for batch in tf_generate_dataset:
        predictions = model.generate(
            input_ids=batch["input_ids"], attention_mask=batch["attention_mask"]
        )
        decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
        labels = batch["labels"].numpy()
        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
        decoded_preds = [pred.strip() for pred in decoded_preds]
        decoded_labels = [[label.strip()] for label in decoded_labels]
        all_preds.extend(decoded_preds)
        all_labels.extend(decoded_labels)

    result = metric.compute(predictions=all_preds, references=all_labels)
    return {"bleu": result["score"]}

In [30]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [34]:
print(compute_metrics())

KeyboardInterrupt: ignored

In [1]:
from transformers import create_optimizer
from transformers.keras_callbacks import PushToHubCallback
import tensorflow as tf
tf.config.experimental_run_functions_eagerly(True)


# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs. Note that the tf_train_dataset here is a batched tf.data.Dataset,
# not the original Hugging Face Dataset, so its len() is already num_samples // batch_size.
num_epochs = 1
num_train_steps = len(tf_train_dataset) * num_epochs

optimizer, schedule = create_optimizer(
    init_lr=5e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)
model.compile(optimizer=optimizer)

# Train in mixed-precision float16
tf.keras.mixed_precision.set_global_policy("mixed_float16")

Instructions for updating:
Use `tf.config.run_functions_eagerly` instead of the experimental version.


NameError: ignored

In [None]:
from transformers.keras_callbacks import PushToHubCallback

callback = PushToHubCallback(
    output_dir="marian-finetuned-kde4-en-to-xh", tokenizer=tokenizer
)

model.fit(
    tf_train_dataset,
    validation_data=tf_eval_dataset,
    callbacks=[callback],
    epochs=num_epochs,
)

/content/marian-finetuned-kde4-en-to-xh is already a clone of https://huggingface.co/Thethela/marian-finetuned-kde4-en-to-xh. Make sure you pull the latest changes with `repo.git_pull()`.


Epoch 1/3
  12/1867 [..............................] - ETA: 10:30:17 - loss: 1.5825

In [None]:
print(compute_metrics())

In [None]:
from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "huggingface-course/marian-finetuned-kde4-en-to-fr"
translator = pipeline("translation", model=model_checkpoint)
translator("Default to expanded threads")

[{'translation_text': 'Par défaut, développer les fils de discussion'}]

In [None]:
translator(
    "Unable to import %1 using the OFX importer plugin. This file is not the correct format."
)

[{'translation_text': "Impossible d'importer %1 en utilisant le module externe d'importation OFX. Ce fichier n'est pas le bon format."}]