<a href="https://colab.research.google.com/github/Deyonrose/ADA/blob/main/2348513_LLM_MiniProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this project, we aim to create a simple machine translation model using the **MarianMT model**, focusing on translating text from English to German. We use a small portion of the WMT16 dataset for English-German translation.

The **MarianMT model** is a transformer-based neural machine translation model developed by the Microsoft Translator team. It excels in translating between a wide range of language pairs due to its extensive pretraining on large multilingual datasets. MarianMT uses subword tokenization to handle diverse vocabulary and complex linguistic structures, making it highly effective for general translation tasks. Available through the Hugging Face Transformers library, it is a versatile tool for high-quality machine translation applications.

In [1]:
!pip install transformers datasets torch

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2

We start by loading 1% of the WMT16 dataset. The dataset contains parallel sentences in German and English, which we will use to train our translation model.

In [2]:
from datasets import load_dataset

# Load 1% of the WMT16 dataset (English-German) for training
dataset = load_dataset("wmt16", "de-en", split="train[:1%]")

# Shuffle the dataset and select 1000 examples
dataset = dataset.shuffle(seed=42).select(range(1000))

# Display the first few examples
print(dataset[0])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/11.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/282M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/267M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/277M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/343k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/475k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4548885 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2169 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2999 [00:00<?, ? examples/s]

{'translation': {'de': 'Wir haben uns bei dieser Abstimmung für eine rasche und konsequente Umsetzung der Rückverfolgbarkeit eingesetzt, damit eine lückenlose Rückverfolgbarkeit sowie klare und transparente Informationen gewährleistet werden, und wir sind mit den Ergebnissen zufrieden.', 'en': 'During this vote, we supported the rapid and rigorous application of traceability, with a view to ensuring unfailing traceability and clear and transparent information, and we are satisfied with the results.'}}


#Data Preprocessing




1.   We remove extra whitespaces, non-alphanumeric characters (except punctuation), and convert the text to lowercase.
2.   We tokenize sentences and words, removing any tokens that are too short (e.g., single characters), which helps clean up the text for better model performance.
3. We use the MarianMT tokenizer to convert the cleaned text into a format that the model can understand (tokenized sequences). Both the input (English) and target (German) sentences are tokenized.



In [3]:
import re
import nltk
from transformers import MarianTokenizer, MarianMTModel

# Download and initialize the NLTK tokenizer
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

# Initialize the MarianMT tokenizer
tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de")

def clean_text(text):
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)

    # Remove non-alphanumeric characters (excluding punctuation)
    text = re.sub(r"[^\w\s\.,!?\'\"-]", '', text)

    # Lowercase the text
    text = text.lower()

    # Tokenize sentences
    sentences = sent_tokenize(text)

    # Tokenize words and remove short tokens (e.g., single characters)
    sentences = [word_tokenize(sentence) for sentence in sentences]
    sentences = [' '.join([word for word in sentence if len(word) > 1]) for sentence in sentences]

    # Recombine sentences
    text = ' '.join(sentences)

    return text

def preprocess_function(examples):
    # Extract and clean texts
    source_texts = [clean_text(item['en']) for item in examples['translation']]
    target_texts = [clean_text(item['de']) for item in examples['translation']]

    # Tokenize the inputs and targets
    inputs = tokenizer(source_texts, max_length=128, truncation=True, padding="max_length")
    targets = tokenizer(target_texts, max_length=128, truncation=True, padding="max_length")

    # Return tokenized data
    return {'input_ids': inputs['input_ids'], 'attention_mask': inputs['attention_mask'], 'labels': targets['input_ids']}

# Apply preprocessing function to the dataset
tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Display the first processed example
print("Processed example:", tokenized_datasets[0])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Processed example: {'translation': {'de': 'Wir haben uns bei dieser Abstimmung für eine rasche und konsequente Umsetzung der Rückverfolgbarkeit eingesetzt, damit eine lückenlose Rückverfolgbarkeit sowie klare und transparente Informationen gewährleistet werden, und wir sind mit den Ergebnissen zufrieden.', 'en': 'During this vote, we supported the rapid and rigorous application of traceability, with a view to ensuring unfailing traceability and clear and transparent information, and we are satisfied with the results.'}, 'input_ids': [546, 60, 3793, 95, 4093, 4, 8470, 8, 33657, 786, 7, 41178, 33, 704, 12, 6765, 318, 2481, 22085, 41178, 8, 1323, 8, 5977, 229, 8, 95, 48, 10631, 33, 4, 924, 0, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 58100, 

We initialize the MarianMT model and configure the training settings. The model is trained for three epochs, and we use a learning rate of 2e-5 with a batch size of 8.

In [4]:
from transformers import MarianMTModel, Seq2SeqTrainer, Seq2SeqTrainingArguments

# Initialize the MarianMT model
model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-de")

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="no",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=True,
)

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]



After training the model, we test it on new German sentences. The model generates English translations, which we compare with our expected results to evaluate the model's performance.

In [5]:
from transformers import DataCollatorForSeq2Seq

# Initialize data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Initialize the Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,                  # Training arguments
    data_collator=data_collator,         # Data collator
    train_dataset=tokenized_datasets,    # Training dataset
    eval_dataset=tokenized_datasets      # Evaluation dataset
)


# Train the model
trainer.train()


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Step,Training Loss


Non-default generation parameters: {'max_length': 512, 'num_beams': 4, 'bad_words_ids': [[58100]], 'forced_eos_token_id': 0}


TrainOutput(global_step=375, training_loss=1.7021329752604166, metrics={'train_runtime': 49.1452, 'train_samples_per_second': 61.044, 'train_steps_per_second': 7.63, 'total_flos': 101695094784000.0, 'train_loss': 1.7021329752604166, 'epoch': 3.0})

Once trained, the model can generate translations for new English sentences. We simply provide an English sentence, and the model outputs the corresponding German translation.

The model successfully translates the given English sentences into German.

In [7]:
import torch

# Ensure the model is on the correct device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def generate_translation(text):
    # Prefix is usually not needed for MarianMT models, so we'll use plain text
    input_ids = tokenizer.encode(text, return_tensors="pt").to(device)

    with torch.no_grad():  # Disable gradient calculation
        # Generate translations
        translated_ids = model.generate(input_ids, max_length=512)

    # Decode the generated text
    return tokenizer.decode(translated_ids[0], skip_special_tokens=True)

# Sample sentences for testing the model
sample_sentences = ["My name is Deyon.", "This is my llm mini project.", "This is english to german translator."]
for sentence in sample_sentences:
    print(f"English: {sentence}")
    print(f"German: {generate_translation(sentence)}\n")


English: My name is Deyon.
German: Ich heisse Deyon.

English: This is my llm mini project.
German: Das ist mein llm-Miniprojekt.

English: This is english to german translator.
German: Das ist ein english angermanischer translator.

