# **Project Title: Multilingual Paraphrase Generation and Translation**


## **Problem Statement:**
In many multilingual environments, such as global content localization or customer support systems, it is essential to generate paraphrases in one language and then translate them to a different language. This helps in content adaptation, ensuring that the meaning is preserved across languages. The task is to develop a system that:

* Takes a sentence in one language (e.g., English).
* Generates a paraphrase in the same language.
* Translates the paraphrased sentence into a target language (e.g., French).

We'll use a pretrained multilingual model, mBART, to handle the paraphrasing and translation tasks.

# **Steps to Fine-Tune mBART on a Paraphrase Dataset:**

* **Install Dependencies:** Install necessary libraries.
* **Load mBART Model and Tokenizer:** The mBART model (facebook/mbart-large-50-many-to-many-mmt) and its corresponding tokenizer are loaded to handle multiple language translation tasks.

* **Apply LoRA (Low-Rank Adaptation):** LoRA is applied to specific submodules of the model (k_proj, v_proj, q_proj, out_proj) to make the model more efficient by introducing low-rank matrices for adaptation. This reduces the number of parameters that need to be trained.

* **Load the Dataset:** The glue dataset (MRPC task) is loaded for fine-tuning, which is a dataset used for sentence-pair classification tasks.

* **Preprocess the Dataset:** The dataset is tokenized for both sentences in each pair using the mBART tokenizer. The tokenized input and target sequences are aligned as inputs and labels for training.

* **Define Training Arguments:** The TrainingArguments specify settings for the training process, including batch size, learning rate, number of epochs, and mixed-precision training.

* **Train the Model Using Trainer API:** The Trainer is set up with the model, training arguments, and tokenized dataset for fine-tuning.

* **Fine-tune the Model:** The train() method of the Trainer is called to start the fine-tuning process on the MRPC task with LoRA applied.

In [None]:
!pip install transformers peft datasets accelerate

from transformers import MBartForConditionalGeneration, MBartTokenizer, Trainer, TrainingArguments
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_dataset

#Load mBART model and tokenizer
model_name = "facebook/mbart-large-50-many-to-many-mmt"
model = MBartForConditionalGeneration.from_pretrained(model_name)
tokenizer = MBartTokenizer.from_pretrained(model_name)

# Apply LoRA (Low-Rank Adaptation)
lora_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["k_proj", "v_proj", "q_proj", "out_proj"],
)
model = get_peft_model(model, lora_config)

#Load the dataset
dataset = load_dataset("glue", "mrpc")

#Preprocessig
def preprocess_function(examples):
    inputs = tokenizer(examples["sentence1"], truncation=True, max_length=64, padding="max_length")
    targets = tokenizer(examples["sentence2"], truncation=True, max_length=64, padding="max_length")
    inputs["labels"] = targets["input_ids"]
    return inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)

#  small subset for quick fine-tuning
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(500))

# training arguments
training_args = TrainingArguments(
    output_dir="./mbart_lora",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=5e-5,
    fp16=True,
    save_strategy="no",
    logging_steps=10,
)

# Training the model using Trainer API
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
)

# fine tuning
trainer.train()




The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'MBart50Tokenizer'. 
The class this function is called from is 'MBartTokenizer'.


Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mhhifzahhaleem[0m ([33mhhifzahhaleem-comsats-university-islamabad[0m). Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
10,29.4414
20,28.5495
30,27.9294
40,25.6877
50,26.7978
60,26.4979
70,23.7039
80,25.9674
90,26.0466


TrainOutput(global_step=93, training_loss=26.7369139476489, metrics={'train_runtime': 5740.7546, 'train_samples_per_second': 0.261, 'train_steps_per_second': 0.016, 'total_flos': 199618713354240.0, 'train_loss': 26.7369139476489, 'epoch': 2.928})

# **Generate Paraphrase & Translate into Multiple Languages:**

* **Function:** The generate_paraphrase_and_translate_multiple_languages function takes an input sentence, source language, and a list of target languages while generating paraphrased outputs.It  uses beam search for diversity and truncates outputs to a maximum length of 64 tokens.
* **Tokenization:** The input sentence is tokenized and prepared for generation.
Source Language Setting: The source language code is set to guide translation.
* **Translation Loop:** For each target language, the function generates a paraphrase and translation using beam search for diversity.
The function returns a dictionary of paraphrased translations in the target languages.

In [None]:
# Save the model and tokenizer
model.save_pretrained("./fine_tuned_mbart")
tokenizer.save_pretrained("./fine_tuned_mbart")

#function to generate paraphrase and translate into multiple target languages
def generate_paraphrase_and_translate_multiple_languages(sentence, source_lang="en_XX", target_languages=["fr_XX", "de_DE", "es_XX"]):
    """
    Generate paraphrases and translations in multiple target languages.
    :param sentence: Input sentence in the source language.
    :param source_lang: Language code of the source sentence.
    :param target_languages: List of target language codes.
    :return: Dictionary of paraphrased translations.
    """
    # Tokenize the input sentence
    inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=64)
    inputs["input_ids"] = inputs["input_ids"].to(model.device)

    # Set the source language
    tokenizer.src_lang = source_lang

    paraphrased_translations = {}

    for target_lang in target_languages:
        # Get the target language ID
        forced_bos_token_id = tokenizer.lang_code_to_id[target_lang]

        # Generate paraphrase and translation
        outputs = model.generate(
            **inputs,
            forced_bos_token_id=forced_bos_token_id,
            num_beams=5,   # Beam search for diversity
            max_length=64, # Maximum output length
            early_stopping=True
        )

        # Decode and save the output
        paraphrased_translations[target_lang] = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return paraphrased_translations

# testing
sentence = "The quick brown fox jumps over the lazy dog."
source_language = "en_XX"
target_languages = ["fr_XX", "de_DE", "es_XX"]  # French, German, Spanish

# Generate paraphrases and translations
results = generate_paraphrase_and_translate_multiple_languages(sentence, source_lang=source_language, target_languages=target_languages)

# results
print("Input Sentence:", sentence)
for lang, translated_sentence in results.items():
    print(f"Translation in {lang}: {translated_sentence}")


Input Sentence: The quick brown fox jumps over the lazy dog.
Translation in fr_XX: rapide La fox brune va sur le chien lazi.
Translation in de_DE: Der schnelle braune Fuchs springt über den faulen Hund.
Translation in es_XX: La fox de color marrón rápido salta sobre el cane lejano.
