1. Problem Statement

Language translation remains a complex challenge due to structural differences, contextual variations, and semantic ambiguity across languages. Traditional phrase-based or statistical machine translation models often fail to capture long-range dependencies and contextual meaning. With the rise of multilingual digital communication, there is a strong need for an automated, accurate, and scalable translation system.

This project aims to develop a Transformer-based Neural Machine Translation (NMT) model for bilingual text translation (English → French). The goal is to build an end-to-end pipeline including preprocessing, model training, and evaluation. Using attention-driven sequence-to-sequence learning, the system is expected to generate high-quality, contextually accurate translations. Translation quality is evaluated using the BLEU score to ensure linguistic fidelity and fluency.

2. Objectives

Build a complete NMT pipeline using modern deep learning techniques.

Implement data preprocessing: cleaning, tokenization, sequence formatting.

Train a Transformer-based seq-to-seq model for bilingual translation.

Evaluate model performance using BLEU score.

Demonstrate translation examples using real test sentences.

Save and deploy the trained model for future inference.

3. Project Scope

This project includes:
✔ Dataset loading and preprocessing
✔ Tokenization, attention mask creation, and padding
✔ Transformer-based seq-to-seq model training
✔ Translation inference pipeline
✔ Performance evaluation (BLEU)
✔ Saving the final model

This project does not focus on:
✘ Low-resource language adaptation
✘ Large-scale multi-language training
✘ Model quantization or deployment optimization
✘ Fine-grained linguistic error analysis

4. Dataset Description

Dataset Used: OPUS Books — English-French parallel corpus

Source: HuggingFace Datasets

Contains aligned bilingual sentence pairs

Suited for machine translation tasks

Includes both training and test splits

Text is literary, providing diverse sentence structures

Automatically downloaded in Colab via load_dataset()

In [3]:
# Install Dependencies

!pip install transformers datasets sacrebleu nltk tensorflow

import nltk
nltk.download("punkt")

Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting portalocker (from sacrebleu)
  Downloading portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading sacrebleu-2.5.1-py3-none-any.whl (104 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Downloading portalocker-3.2.0-py3-none-any.whl (22 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.6 portalocker-3.2.0 sacrebleu-2.5.1


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
# Import Libraries

import numpy as np
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
from datasets import load_dataset
from nltk.translate.bleu_score import corpus_bleu
import tensorflow as tf

In [5]:
#  Load Dataset (English-French)

dataset = load_dataset("opus_books", "en-fr")

# Print the dataset object to see available splits
print(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

en-fr/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/127085 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 127085
    })
})


In [6]:
# The dataset 'opus_books' only has a 'train' split.
# So to create a validation set, we will split the training data.
train_validation_split = dataset["train"].train_test_split(test_size=0.1, seed=42) # 10% for validation

train_data = train_validation_split["train"]
test_data = train_validation_split["test"] # Renamed to 'test' for consistency with downstream code

print(train_data[0])
print(f"Number of training examples: {len(train_data)}")
print(f"Number of validation examples: {len(test_data)}")

{'id': '61445', 'translation': {'en': 'Dom Claude interrupted him,−− "You are happy, then?"', 'fr': 'Dom Claude l’interrompit : « Vous êtes donc heureux ? »'}}
Number of training examples: 114376
Number of validation examples: 12709


In [7]:
from datasets import Dataset

# Take a random sample of 1200 items from the full training dataset
sampled = dataset["train"].shuffle(seed=42).select(range(1200))

# Now split this small sample
split_sample = sampled.train_test_split(test_size=0.2, seed=42)

train_data = split_sample["train"]      # ~960 samples
val_data   = split_sample["test"]       # ~240 samples

print(len(train_data), len(val_data))

960 240


In [8]:
# Load Tokenizer & Model (Transformer NMT)

model_name = "Helsinki-NLP/opus-mt-en-fr"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]



tf_model.h5:   0%|          | 0.00/301M [00:00<?, ?B/s]

TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.
All model checkpoint layers were used when initializing TFMarianMTModel.

All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-fr.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.


generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

In [9]:
#  Preprocess Function

def preprocess(example):
    inputs = tokenizer(
        example["translation"]["en"],
        padding="max_length",
        truncation=True,
        max_length=64
    )
    targets = tokenizer(
        example["translation"]["fr"],
        padding="max_length",
        truncation=True,
        max_length=64
    )
    inputs["labels"] = targets["input_ids"]
    return inputs



In [10]:
# Apply preprocessing

tokenized_train = train_data.map(preprocess, batched=False)
tokenized_test = val_data.map(preprocess, batched=False)

Map:   0%|          | 0/960 [00:00<?, ? examples/s]

Map:   0%|          | 0/240 [00:00<?, ? examples/s]

In [11]:
#  Create Batching & Data Collator

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")

tf_train_set = tokenized_train.to_tf_dataset(
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    batch_size=8,
    collate_fn=data_collator
)

tf_test_set = tokenized_test.to_tf_dataset(
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    batch_size=8,
    collate_fn=data_collator
)

In [12]:
# Train the Model (Seq2Seq Transformer)

optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
model.compile(optimizer=optimizer)

model.fit(tf_train_set, validation_data=tf_test_set, epochs=1)



<tf_keras.src.callbacks.History at 0x78d6ae11b8f0>

In [13]:
# Evaluation — BLEU Score, BLEU evaluates how close a model-generated sentence is to a human reference, by comparing n-grams (1-gram, 2-gram, 3-gram, etc.).

def translate_sentence(sentence):
    inputs = tokenizer(sentence, return_tensors="tf", padding=True)
    outputs = model.generate(**inputs)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

references = []
hypotheses = []

# Explicitly get the first 100 items as a list of dictionaries to ensure correct iteration
val_samples = [val_data[i] for i in range(100)]

for item in val_samples:
    src = item["translation"]["en"]
    tgt = item["translation"]["fr"]

    pred = translate_sentence(src)

    references.append([tgt.split()])
    hypotheses.append(pred.split())

bleu = corpus_bleu(references, hypotheses)
print("BLEU Score:", bleu)

TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.


BLEU Score: 0.14162858035279485


In [14]:
# Demo Translation

text = "Machine learning is transforming the world."
output = translate_sentence(text)
print(f"English: {text}")
print(f"French:  {output}")

English: Machine learning is transforming the world.
French:  L'apprentissage de la machine transforme le monde.


In [15]:
#  Save Model

model.save_pretrained("nmt_transformer_model")
tokenizer.save_pretrained("nmt_transformer_model")
print("Model Saved!")


Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[59513]]}


Model Saved!
