# Neural Machine Translation using a Transformer model

Finetune [T5](https://huggingface.co/t5-small) on the English-French subset of the [OPUS Books](https://huggingface.co/datasets/opus_books) dataset to translate English text to French.


## Install required libraries

In [31]:
!pip install datasets
!pip install transformers
!pip install evaluate 
!pip install sacrebleu
!pip install --upgrade --no-cache-dir gdown
import locale
locale.getpreferredencoding = lambda: "UTF-8"

NotImplementedError: ignored

## Imports

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorForSeq2Seq
import evaluate
from sklearn.model_selection import train_test_split
import numpy as np
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
from transformers import pipeline
import pandas as pd
from datasets import load_dataset, load_metric, Dataset

## Download the dataset
Loading the English-French subset of the [OPUS Books](https://huggingface.co/datasets/opus_books) dataset:

In [3]:
!gdown 1-nLLvtOF_92WxC0-Uo4wJ5gDeQ9moS_2

Downloading...
From: https://drive.google.com/uc?id=1-nLLvtOF_92WxC0-Uo4wJ5gDeQ9moS_2
To: /content/EN-FR.zip
100% 13.3M/13.3M [00:00<00:00, 27.2MB/s]


In [4]:
!unzip /content/EN-FR.zip -d /content/data/

Archive:  /content/EN-FR.zip
  inflating: /content/data/dev.csv   
  inflating: /content/data/test.csv  
  inflating: /content/data/train.csv  


## Read Dataset

In [5]:
dataset = load_dataset("csv", data_files="/content/data/train.csv")

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-41bee49ed35eb023/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-41bee49ed35eb023/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['en', 'fr'],
        num_rows: 76251
    })
})

In [9]:
print(type(dataset))
print([f for f in dir(dataset) if not f.startswith('_')])

<class 'datasets.dataset_dict.DatasetDict'>
['align_labels_with_mapping', 'cache_files', 'cast', 'cast_column', 'class_encode_column', 'cleanup_cache_files', 'clear', 'column_names', 'copy', 'data', 'filter', 'flatten', 'formatted_as', 'from_csv', 'from_json', 'from_parquet', 'from_text', 'fromkeys', 'get', 'items', 'keys', 'load_from_disk', 'map', 'num_columns', 'num_rows', 'pop', 'popitem', 'prepare_for_task', 'push_to_hub', 'remove_columns', 'rename_column', 'rename_columns', 'reset_format', 'save_to_disk', 'select_columns', 'set_format', 'set_transform', 'setdefault', 'shape', 'shuffle', 'sort', 'unique', 'update', 'values', 'with_format', 'with_transform']


In [10]:
dataset['train'][8]

{'en': 'The truth was that an idiotic ambition had alone impelled Camille to leave Vernon.',
 'fr': "La vérité était qu'une ambition bête avait seule poussé Camille au départ."}

In [11]:
val_data = pd.read_csv('/content/data/dev.csv')
ds_val = Dataset.from_pandas(val_data)

test_data = pd.read_csv('/content/data/test.csv')
ds_test = Dataset.from_pandas(test_data)


dataset["validation"] = ds_val
dataset["test"] = ds_test

In [12]:
dataset

DatasetDict({
    train: Dataset({
        features: ['en', 'fr'],
        num_rows: 76251
    })
    validation: Dataset({
        features: ['en', 'fr'],
        num_rows: 25417
    })
    test: Dataset({
        features: ['en', 'fr'],
        num_rows: 25417
    })
})

### Load the ``T5`` tokenizer to process the English-French language pairs:

In [13]:
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [14]:
tokenizer(['At length, however, the remarks of her companions on her absence of mind aroused her, and she felt the necessity of appearing more like herself.'])

{'input_ids': [[486, 2475, 6, 983, 6, 8, 21029, 13, 160, 9663, 7, 30, 160, 8605, 13, 809, 1584, 32, 10064, 160, 6, 11, 255, 1800, 8, 16696, 13, 16069, 72, 114, 6257, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

The preprocessing steps we need to create are:

1. Prefix the input with a prompt so T5 knows this is a translation task. Some models capable of multiple NLP tasks require prompting for specific tasks.

2. Tokenize the input (English) and target (French).


3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

In [15]:
source_lang = "en"
target_lang = "fr"
prefix = "translate English to French: "

def preprocess_function(examples):
    inputs = [prefix + example for example in examples[source_lang]]
    targets = [example for example in examples[target_lang]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

Apply the preprocessing function over the entire dataset:
- use the ``map`` method.
- ``batched=True`` to process multiple elements of the dataset at once.

In [16]:
tokenized_data = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/76251 [00:00<?, ? examples/s]

Map:   0%|          | 0/25417 [00:00<?, ? examples/s]

Map:   0%|          | 0/25417 [00:00<?, ? examples/s]

In [17]:
print(tokenized_data['train'][10])

{'en': 'But presently his whole attention was absorbed in twelve or fifteen pretty women who, seated opposite the dock, filled the three galleries above the bench and the jurybox.', 'fr': 'Mais bientôt toute son attention fut absorbée par douze ou quinze jolies femmes qui, placées vis-à-vis la sellette de l’accusé, remplissaient les trois balcons au-dessus des juges et des jurés.', 'input_ids': [13959, 1566, 12, 2379, 10, 299, 3, 25390, 112, 829, 1388, 47, 3, 19402, 16, 13369, 42, 17310, 1134, 887, 113, 6, 3, 22933, 6401, 8, 12908, 6, 3353, 8, 386, 18035, 756, 8, 8453, 11, 8, 12730, 2689, 5, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [3307, 21707, 2633, 520, 1388, 9620, 8074, 721, 260, 103, 10953, 407, 285, 29, 776, 13773, 15, 7, 9382, 285, 6, 6670, 1325, 4642, 18, 85, 18, 3466, 50, 1789, 1954, 20, 3, 40, 22, 6004, 302, 154, 6, 15636, 7, 7, 5635, 110, 5611, 19615, 7, 185, 1

### Dynamically pad the inputs received

In [18]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_name)

## Evaluation Metric
Including a metric for evaluating the model's performance. You can quickly load a evaluation method with the [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [SacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu) metric.

In [19]:
metric = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [20]:
# Model might produce some extra spaces or line breaks 
# that are not present in the ground truth labels, and these could affect the evaluation metrics.
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]

    # decode the predicted values into human-readable text
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace all occurrences of -100 in the labels array with the ID of the pad token 
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    # decode the labels values into human-readable text
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    # round it to four decimal places
    result = {k: round(v, 4) for k, v in result.items()}
    return result

# Training

In [21]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda")

Downloading pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

At this point, only three steps remain:

1. Define your training hyperparameters in [Seq2SeqTrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments). 

2. Pass the training arguments to [Seq2SeqTrainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Seq2SeqTrainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.

3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [22]:
training_args = Seq2SeqTrainingArguments(
    output_dir="results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    save_total_limit=3,
    num_train_epochs=2,
    # reduce memory usage
    fp16=True,

)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)

In [23]:
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Bleu
1,1.8466,1.622129,5.5423
2,1.818,1.597441,5.6991


TrainOutput(global_step=19064, training_loss=1.868828809916648, metrics={'train_runtime': 4170.7462, 'train_samples_per_second': 36.565, 'train_steps_per_second': 4.571, 'total_flos': 3148841734569984.0, 'train_loss': 1.868828809916648, 'epoch': 2.0})

In [24]:
trainer.save_model('T5_checkpoint')

In [25]:
!cp -r /content/T5_checkpoint /content/drive/MyDrive/

NotImplementedError: ignored

# Testing

In [32]:
text = "translate English to French: Legumes share resources with nitrogen-fixing bacteria."
text2 = "my name is John"
text3 = 'he died'

In [33]:
def predict(sentence):
  inputs = tokenizer(sentence, return_tensors="pt").input_ids
  model = AutoModelForSeq2SeqLM.from_pretrained("/content/T5_checkpoint")
  outputs = model.generate(inputs)
  outputs = tokenizer.decode(outputs[0], skip_special_tokens=True)
  return outputs

In [34]:
pred = predict(text)
pred

'Les légumes partagent les ressources avec les bactéries fixatrice'

In [35]:
pred = predict(text2)
pred

'..'

# Lab Task
Clean the text, re-train and report performance difference.

In [36]:
from nltk.corpus import stopwords

# Update on the preprocessing function to include cleaning techniques
def preprocess_function(examples):
  # Normalization (lowercase, remove unicodes & uniques)
  examples = tokenizer.backend_tokenizer.normalizer.normalize_str(examples)
  # Remove stopwords
  stop_words = {stopwords.words('english')}

  inputs = [prefix + example for example in examples[source_lang]]
  targets = [example for example in examples[target_lang]]
  model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
  return model_inputs

In [37]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)
trainer.train()



Epoch,Training Loss,Validation Loss,Bleu
1,1.7394,1.552374,6.0316
2,1.7453,1.538318,6.0904


TrainOutput(global_step=19064, training_loss=1.7420627283439685, metrics={'train_runtime': 3997.6822, 'train_samples_per_second': 38.148, 'train_steps_per_second': 4.769, 'total_flos': 3148841734569984.0, 'train_loss': 1.7420627283439685, 'epoch': 2.0})

## Text cleaning resulted in better performance as it allowed the model to undestand the data better by removing the caplitalized counterparts of words, the unicodes which are unique and usless in addition to removal of unnececary words that do not contribute to the semantics.