<a href="https://colab.research.google.com/github/Rokoson/FacebookIV/blob/master/english_yoruba_machine_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## English to Yoruba machine translation: Fine-tuning a pretrained Huggingface Transformer model

### Step 1:Install transformer libraries

In [1]:
!pip install datasets transformers[sentencepiece]

Collecting datasets
  Downloading datasets-1.18.3-py3-none-any.whl (311 kB)
[?25l[K     |█                               | 10 kB 28.6 MB/s eta 0:00:01[K     |██                              | 20 kB 22.1 MB/s eta 0:00:01[K     |███▏                            | 30 kB 12.1 MB/s eta 0:00:01[K     |████▏                           | 40 kB 10.1 MB/s eta 0:00:01[K     |█████▎                          | 51 kB 7.1 MB/s eta 0:00:01[K     |██████▎                         | 61 kB 8.2 MB/s eta 0:00:01[K     |███████▍                        | 71 kB 8.8 MB/s eta 0:00:01[K     |████████▍                       | 81 kB 7.5 MB/s eta 0:00:01[K     |█████████▌                      | 92 kB 8.3 MB/s eta 0:00:01[K     |██████████▌                     | 102 kB 8.0 MB/s eta 0:00:01[K     |███████████▋                    | 112 kB 8.0 MB/s eta 0:00:01[K     |████████████▋                   | 122 kB 8.0 MB/s eta 0:00:01[K     |█████████████▊                  | 133 kB 8.0 MB/s eta 0:00:01

### Sep 2: Load a prerained model
### The pre-trained model will be loaded from Huggingface hub https://huggingface.co/omoekan/opus-tatoeba-eng-yor





In [5]:
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM
model_name = 'omoekan/opus-tatoeba-eng-yor'

model = TFAutoModelForSeq2SeqLM.from_pretrained(model_name, from_pt=True) # converts the pytorch model to tf
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/126M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFMarianMTModel: ['lm_head.weight']
- This IS expected if you are initializing TFMarianMTModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMarianMTModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFMarianMTModel were not initialized from the PyTorch model and are newly initialized: ['model.encoder.embed_positions.weight', 'model.decoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading:   0%|          | 0.00/267 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/430k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/549k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

In [6]:
from transformers import pipeline
translator = pipeline('translation', model=model, tokenizer=tokenizer)

In [7]:
translator('thank you')

[{'translation_text': 'dúpé lówó rẹ'}]

### Step 3: Get data to fine-tune the model on. We will use the menyo20k dataset which is also availabe on the Huggingface hub https://huggingface.co/datasets/menyo20k_mt . It has 10K pairs of english and yoruba sentences/phrases.

In [8]:
from datasets import load_dataset

dataset = load_dataset("menyo20k_mt")

Downloading:   0%|          | 0.00/1.91k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading and preparing dataset menyo20k_mt/menyo20k_mt (download: 2.38 MiB, generated: 2.43 MiB, post-processed: Unknown size, total: 4.81 MiB) to /root/.cache/huggingface/datasets/menyo20k_mt/menyo20k_mt/1.0.0/96c9c82d2a5afc5726b868d436c0b8ae3eb7cbeea393e76b70cb3ded479d0376...


Downloading:   0%|          | 0.00/822k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

Dataset menyo20k_mt downloaded and prepared to /root/.cache/huggingface/datasets/menyo20k_mt/menyo20k_mt/1.0.0/96c9c82d2a5afc5726b868d436c0b8ae3eb7cbeea393e76b70cb3ded479d0376. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [9]:
dataset

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 10070
    })
})

In [10]:
dataset["train"][:5]

{'translation': [{'en': 'Unit 1: What is Creative Commons?',
   'yo': '\ufeffÌdá 1: Kín ni Creative Commons?'},
  {'en': 'This work is licensed under a Creative Commons Attribution 4.0 International License.',
   'yo': 'Iṣẹ́ yìí wà lábẹ́ àṣẹ Creative Commons Attribution 4.0 International License.'},
  {'en': 'Creative Commons is a set of legal tools, a nonprofit organization, as well as a global network and a movement — all inspired by people’s willingness to share their creativity and knowledge, and enabled by a set of open copyright licenses.',
   'yo': 'Creative Commons jẹ́ àwọn ọ̀kan-ò-jọ̀kan ohun-èlò ajẹmófin, iléeṣẹ́ àìlérèlórí, àti àjọ àwọn ènìyàn eléròǹgbà kan náà kárí àgbáńlá ayé— tí í ṣe ìmísí àwọn ènìyànkan tí ó ní ìfẹ́ tinútinú láti pín àwọn iṣẹ́-àtinúdá àti ìmọ̀ wọn èyí tí ó ní àtìlẹ́yìn àwọn ọ̀kan-ò-jọ̀kan àṣẹ ìṣísílẹ̀-gbangba-wálíà fún àtúnlò.'},
  {'en': 'Creative Commons began in response to an outdated global copyright legal system.',
   'yo': 'Creative Commons bẹ̀rẹ̀

In [11]:
dataset["train"]["translation"][:5]

[{'en': 'Unit 1: What is Creative Commons?',
  'yo': '\ufeffÌdá 1: Kín ni Creative Commons?'},
 {'en': 'This work is licensed under a Creative Commons Attribution 4.0 International License.',
  'yo': 'Iṣẹ́ yìí wà lábẹ́ àṣẹ Creative Commons Attribution 4.0 International License.'},
 {'en': 'Creative Commons is a set of legal tools, a nonprofit organization, as well as a global network and a movement — all inspired by people’s willingness to share their creativity and knowledge, and enabled by a set of open copyright licenses.',
  'yo': 'Creative Commons jẹ́ àwọn ọ̀kan-ò-jọ̀kan ohun-èlò ajẹmófin, iléeṣẹ́ àìlérèlórí, àti àjọ àwọn ènìyàn eléròǹgbà kan náà kárí àgbáńlá ayé— tí í ṣe ìmísí àwọn ènìyànkan tí ó ní ìfẹ́ tinútinú láti pín àwọn iṣẹ́-àtinúdá àti ìmọ̀ wọn èyí tí ó ní àtìlẹ́yìn àwọn ọ̀kan-ò-jọ̀kan àṣẹ ìṣísílẹ̀-gbangba-wálíà fún àtúnlò.'},
 {'en': 'Creative Commons began in response to an outdated global copyright legal system.',
  'yo': 'Creative Commons bẹ̀rẹ̀ láti wá wọ̀rọ̀kọ̀ fi ṣ

In [12]:
# split dataset into a train test split and rename the test set 'validation'
# Because this is a small dataset we will set aside onlt 5% for validation

split_datasets = dataset["train"].train_test_split(train_size=0.95, seed=20)
split_datasets["validation"] = split_datasets.pop("test")
split_datasets

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 9566
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 504
    })
})

### Step 4: Preprocess dataset
#### We tokenize the datasets using the tokenizer making sure to select en as the source langusage and yo as the target

In [13]:
max_input_length = 128
max_target_length = 128



def preprocess_function(examples):
    inputs = [ex["en"] for ex in examples["translation"]]
    targets = [ex["yo"] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Set up the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [14]:
tokenized_datasets = split_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=split_datasets["train"].column_names,
)

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

#### Since we are using tensorflow we transform datasets to tf_datasets and specify a data collator appropriate for a seq2seq language model

In [15]:
# Data collator
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

In [16]:
tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
    columns=["input_ids", "attention_mask", "labels"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
)
tf_eval_dataset = tokenized_datasets["validation"].to_tf_dataset(
    columns=["input_ids", "attention_mask", "labels"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=16,
)

### Step 5: Fine-tune the model with keras

In [17]:
from transformers import create_optimizer
import tensorflow as tf

num_epochs = 3
num_train_steps = len(tf_train_dataset) * num_epochs

optimizer, schedule = create_optimizer(
    init_lr=5e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)
model.compile(optimizer=optimizer)


No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! Please ensure your labels are passed as keys in the input dict so that they are accessible to the model during the forward pass. To disable this behaviour, please pass a loss argument, or explicitly pass loss=None if you do not want your model to compute a loss.


#####  Let's evaluate the model before training on the validation set

In [18]:
model.evaluate(tf_eval_dataset) # ignore the warnings

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Cause: while/else statement not yet supported
Cause: while/else statement not yet supported


3.112367630004883

In [19]:
model.fit(
    tf_train_dataset,
    validation_data=tf_eval_dataset,
    epochs=num_epochs,
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f34462789d0>

In [20]:
# you can test your model using the pipeline function
from transformers import pipeline

translator = pipeline("translation", model=model, tokenizer=tokenizer)

In [21]:
translator('Thank you') # better than pretrained!

[{'translation_text': 'Ẹ ṣeun'}]

#### Step 6: Save your model

In [None]:
model.save_pretrained('your file path')
tokenizer.save_pretrained('your file path')