#**CS355: Introduction to Large Language Models (LLMs)**
---
## **Assignment 2:** Fine Tuning, Evaluation and Data Augmentation

### Student Name: <code>[Ali Muhammad Asad]</code>
### Student ID: <code>[aa07190]</code>

## **Assignment Objectives**
This assignment will guide students through the process of Fine Tuning a mode, how to evaluate a model and augment the data for a model. By the end of the assignment, students will:
*   Be able to fine tune multiple models

*   Evaluate models on multiple criterias.

*   Augment data using various augmentation strategies.



---

## **READ THESE INSTRUCTIONS FIRST**

* There are exactly **3** tasks in this notebook.

* Do not change or remove any pre-written code. The provided code is included intentionally. Make sure to pay special attention to import statements, variable names, and pre-written comments in the code cells.

* Carefully read the task description before beginning each task to ensure you understand what is required.

* **There is no penalty for using AI assistance on this homework** as long as you fully disclose it and understand the solution you have provided. If you do use AI please disclose its use in the cell below.

* Ensure that all code cells in your notebook are executed before submission, with the output clearly visible. If errors are encountered during evaluation, marks will only be awarded for tasks completed up to the error-producing cell. Any attempt to misrepresent the output, such as showing results not generated by the code, will be considered a violation of academic integrity, resulting in an automatic score of zero for the assignment.

* **Submit the completed and fully executed notebook file as your final submission**.






 ### Did you use any AI assistance to complete this assignment?
* *your response here*



# Background on fine-tuning LLMs

**Summary:**

1. **LLM Pretraining:**
   - Large Language Models (LLMs) are pretrained on extensive text corpora.
   - Llama 2 was pretrained on a dataset of 2 trillion tokens, compared to BERT's training on BookCorpus and Wikipedia.
   - Pretraining is resource-intensive and time-consuming.

2. **Auto-Regressive Prediction:**
   - Llama 2, an auto-regressive model, predicts the next token in a sequence.
   - Auto-regressive models lack usefulness in providing instructions, leading to the need for instruction tuning.

3. **Fine-Tuning Techniques:**
   - Instruction tuning uses two main fine-tuning techniques:
     a. Supervised Fine-Tuning (SFT): Trained on instruction-response datasets, minimizing differences between generated and actual responses.
     b. Reinforcement Learning from Human Feedback (RLHF): Trained to maximize rewards based on human evaluations.

4. **RLHF vs. SFT:**
   - RLHF captures complex human preferences but requires careful reward system design and consistent human feedback.
   - Direct Preference Optimization (DPO) might be a future alternative to RLHF.
   - SFT can be highly effective when the model hasn't encountered specific data during pretraining.

# Fine-tuning BART for summarization: A detailed Example

This notebook contains an example of fine-tuning [Bart](https://huggingface.co/transformers/model_doc/bart.html) for generating summaries of article sections from the [WikiLingua](https://huggingface.co/datasets/wiki_lingua) dataset. WikiLingua is a multilingual set of articles. Firstly run the model for english model from [Hugging Face Model Hub](https://huggingface.co/models). We will be using the **English** portion of WikiLingua with [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn) Bart checkpoint.

Please go through this example to understand the whole process of fine tuning a model.

## Setup

---

In [None]:
! pip install transformers --break-system-packages
! pip install datasets --break-system-packages
! pip install sentencepiece --break-system-packages
! pip install rouge_score --break-system-packages
! pip install wandb --break-system-packages
! pip install nltk --break-system-packages
! pip install tabulate --break-system-packages

In [1]:
import torch
import numpy as np
import datasets

from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq,
)

from tabulate import tabulate
import nltk
from datetime import datetime

## Model and tokenizer
Download model and tokenizer. Use default parameters or try custom values (see [HF Bart configuration](https://huggingface.co/transformers/_modules/transformers/configuration_bart.html)).

In [2]:
language = "english"
model_name = "facebook/bart-large-cnn"

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Set model parameters or use the default
# print(model.config)

# tokenization
encoder_max_length = 256
decoder_max_length = 64

## Data

For demonstration, we are only using a small portion of the data.

In [3]:
data = datasets.load_dataset("wiki_lingua", name=language, split="train[:2000]")

# Take a look at the data
for k, v in data["article"][0].items():
    print(k)
    print(v)

section_name
['Finding Other Transportation', 'Designating a Driver', 'Staying Safe']
document
['make sure that the area is a safe place, especially if you plan on walking home at night.  It’s always a good idea to practice the buddy system.  Have a friend meet up and walk with you. Research the bus, train, or streetcar routes available in your area to find safe and affordable travel to your destination.  Make sure you check the schedule for your outgoing and return travel.  Some public transportation will cease to run late at night.  Be sure if you take public transportation to the venue that you will also be able to get home late at night. Check the routes.  Even if some public transit is still running late at night, the routing may change.  Some may run express past many of the stops, or not travel all the way to the ends.  Be sure that your stop will still be available when you need it for your return trip. If you are taking public transit in a vulnerable state after drinking, it i

### Prepare

**Format and split into train and validation sets**

In [4]:
def flatten(example):
    return {
        "document": example["article"]["document"],
        "summary": example["article"]["summary"],
    }

def list2samples(example):
    documents = []
    summaries = []
    for sample in zip(example["document"], example["summary"]):
        if len(sample[0]) > 0:
            documents += sample[0]
            summaries += sample[1]
    return {"document": documents, "summary": summaries}

dataset = data.map(flatten, remove_columns=["article", "url"])
dataset = dataset.map(list2samples, batched=True)

train_data_txt, validation_data_txt = dataset.train_test_split(test_size=0.1).values()

**Preprocess and tokenize**

In [5]:
def batch_tokenize_preprocess(batch, tokenizer, max_source_length, max_target_length):
    source, target = batch["document"], batch["summary"]
    source_tokenized = tokenizer(
        source, padding="max_length", truncation=True, max_length=max_source_length
    )
    target_tokenized = tokenizer(
        target, padding="max_length", truncation=True, max_length=max_target_length
    )

    batch = {k: v for k, v in source_tokenized.items()}
    # Ignore padding in the loss
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in l]
        for l in target_tokenized["input_ids"]
    ]
    return batch

train_data = train_data_txt.map(
    lambda batch: batch_tokenize_preprocess(
        batch, tokenizer, encoder_max_length, decoder_max_length
    ),
    batched=True,
    remove_columns=train_data_txt.column_names,
)

validation_data = validation_data_txt.map(
    lambda batch: batch_tokenize_preprocess(
        batch, tokenizer, encoder_max_length, decoder_max_length
    ),
    batched=True,
    remove_columns=validation_data_txt.column_names,
)

Map:   0%|          | 0/4351 [00:00<?, ? examples/s]

Map:   0%|          | 0/484 [00:00<?, ? examples/s]

In [6]:
!pip install evaluate --break-system-packages

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable


## Training

---

### Metrics

In [7]:
import evaluate
import nltk

nltk.download("punkt_tab")
metric = evaluate.load("rouge")

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(
        predictions=decoded_preds, references=decoded_labels, use_stemmer=True
    )
    # Extract a few results from ROUGE
    result = {key: value for key, value in result.items()}

    prediction_lens = [
        np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds
    ]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/aa07190/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### Training arguments

In [8]:
training_args = Seq2SeqTrainingArguments(
    output_dir="results",
    num_train_epochs=1,
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    # learning_rate=3e-05,
    warmup_steps=500,
    weight_decay=0.1,
    label_smoothing_factor=0.1,
    predict_with_generate=True,
    logging_dir="logs",
    logging_steps=50,
    save_total_limit=3,
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_data,
    eval_dataset=validation_data,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

  trainer = Seq2SeqTrainer(


### Train

Evaluate before fine-tuning

In [9]:
trainer.evaluate()

[34m[1mwandb[0m: Currently logged in as: [33maa07190[0m ([33maa07190-habib-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


{'eval_loss': 4.019029140472412,
 'eval_model_preparation_time': 0.0046,
 'eval_rouge1': 0.2448,
 'eval_rouge2': 0.059,
 'eval_rougeL': 0.1665,
 'eval_rougeLsum': 0.2291,
 'eval_gen_len': 67.9587,
 'eval_runtime': 117.0842,
 'eval_samples_per_second': 4.134,
 'eval_steps_per_second': 1.033}

Train the model

In [10]:
trainer.train()

Step,Training Loss
50,3.8677
100,3.7483
150,3.6467
200,3.6375
250,3.6968
300,3.6567
350,3.641
400,3.6731
450,3.7387
500,3.7715




TrainOutput(global_step=1088, training_loss=3.668931722640991, metrics={'train_runtime': 253.9781, 'train_samples_per_second': 17.131, 'train_steps_per_second': 4.284, 'total_flos': 2357268030947328.0, 'train_loss': 3.668931722640991, 'epoch': 1.0})

Evaluate after fine-tuning

In [11]:
trainer.evaluate()

{'eval_loss': 3.5344974994659424,
 'eval_model_preparation_time': 0.0046,
 'eval_rouge1': 0.3175,
 'eval_rouge2': 0.1179,
 'eval_rougeL': 0.2381,
 'eval_rougeLsum': 0.3063,
 'eval_gen_len': 64.0806,
 'eval_runtime': 106.0111,
 'eval_samples_per_second': 4.566,
 'eval_steps_per_second': 1.141,
 'epoch': 1.0}

## Evaluation

---

**Generate summaries from the fine-tuned model and compare them with those generated from the original, pre-trained one.**

In [12]:
def generate_summary(test_samples, model):
    inputs = tokenizer(
        test_samples["document"],
        padding="max_length",
        truncation=True,
        max_length=encoder_max_length,
        return_tensors="pt",
    )
    input_ids = inputs.input_ids.to(model.device)
    attention_mask = inputs.attention_mask.to(model.device)
    outputs = model.generate(input_ids, attention_mask=attention_mask)
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return outputs, output_str


model_before_tuning = AutoModelForSeq2SeqLM.from_pretrained(model_name)

test_samples = validation_data_txt.select(range(16))

summaries_before_tuning = generate_summary(test_samples, model_before_tuning)[1]
summaries_after_tuning = generate_summary(test_samples, model)[1]

In [13]:
print(
    tabulate(
        zip(
            range(len(summaries_after_tuning)),
            summaries_after_tuning,
            summaries_before_tuning,
        ),
        headers=["Id", "Summary after", "Summary before"],
    )
)
print("\nTarget summaries:\n")
print(
    tabulate(list(enumerate(test_samples["summary"])), headers=["Id", "Target summary"])
)
print("\nSource documents:\n")
print(tabulate(list(enumerate(test_samples["document"])), headers=["Id", "Document"]))

  Id  Summary after                                                                                                                                                                                                                                                                                                     Summary before
----  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Task 1 [40 Points]: Fine Tuning

You might need some luck for this task.
Look at the list given below:
<ol>
  0: Spanish,

  1: Portuguese,

  2: French,

  3: German,

  4: Russian,

  5: Italian,

  6: Indonesian,

  7: Dutch,

  8: Arabic,

  9: Vietnamese
</ol>

Your student ID has 5 digits. you will pick 2nd and 5th digit and pick matching languages from this list. You will import datasets of both these languages from wikilingua and pick 2 models (one for first language and one for second) from hugging face and fine tune.

You will also see that the size of the data is different in all languages. So, to make it fair for everyone, randomize the data and then pick 10k samples for fine tuning for each language. Also, while training, change the number of epochs to 3.

Please note that you can't pick bart-large for this and both picked models should be different. There is also a possibility that 2nd and 5th digit of your student ID are same. In that scenario, you will use 3rd or 4th digit of your student id to pick second language but make sure that this language should be different from the one you already picked as first language.

After fine tuning, save your models on your drive as ModelName_LanguageName_BeforeAugmentation where on the place of model name you will write the name of the model and on the place of language name you will write the name of language.

What is your student ID and which langauges did you pick?

Student id: 07190
Languages: 2nd Digit: 7, 5th Digit: 0
{2: "Dutch", 0: "Spanish"}

In [1]:
# Import necessary libraries
import torch
import numpy as np
import datasets
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq,
)
from tabulate import tabulate
import nltk
from datetime import datetime

In [2]:
# Load the datasets for Dutch and Spanish
dutch_data = datasets.load_dataset("wiki_lingua", name="dutch", split="train")
spanish_data = datasets.load_dataset("wiki_lingua", name="spanish", split="train")

# Randomly sample 10k samples from each dataset
dutch_data = dutch_data.shuffle(seed=42).select(range(10000))
spanish_data = spanish_data.shuffle(seed=42).select(range(10000))

In [3]:
# Load the models and tokenizers
dutch_model_name = "t5-small"
spanish_model_name = "t5-base"

dutch_model = AutoModelForSeq2SeqLM.from_pretrained(dutch_model_name)
dutch_tokenizer = AutoTokenizer.from_pretrained(dutch_model_name)

spanish_model = AutoModelForSeq2SeqLM.from_pretrained(spanish_model_name)
spanish_tokenizer = AutoTokenizer.from_pretrained(spanish_model_name)

# Set model parameters
encoder_max_length = 256
decoder_max_length = 64

In [4]:
# Prepare the data
def flatten(example):
    return {
        "document": example["article"]["document"],
        "summary": example["article"]["summary"],
    }

def list2samples(example):
    documents = []
    summaries = []
    for sample in zip(example["document"], example["summary"]):
        if len(sample[0]) > 0:
            documents += sample[0]
            summaries += sample[1]
    return {"document": documents, "summary": summaries}

dutch_dataset = dutch_data.map(flatten, remove_columns=["article", "url"])
dutch_dataset = dutch_dataset.map(list2samples, batched=True)

spanish_dataset = spanish_data.map(flatten, remove_columns=["article", "url"])
spanish_dataset = spanish_dataset.map(list2samples, batched=True)

In [5]:
# Split into train and validation sets
dutch_train_data_txt, dutch_validation_data_txt = dutch_dataset.train_test_split(test_size=0.1).values()
spanish_train_data_txt, spanish_validation_data_txt = spanish_dataset.train_test_split(test_size=0.1).values()

In [6]:
# Tokenize the data
def batch_tokenize_preprocess(batch, tokenizer, max_source_length, max_target_length):
    source, target = batch["document"], batch["summary"]
    source_tokenized = tokenizer(
        source, padding="max_length", truncation=True, max_length=max_source_length
    )
    target_tokenized = tokenizer(
        target, padding="max_length", truncation=True, max_length=max_target_length
    )

    batch = {k: v for k, v in source_tokenized.items()}
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in l]
        for l in target_tokenized["input_ids"]
    ]
    return batch

dutch_train_data = dutch_train_data_txt.map(
    lambda batch: batch_tokenize_preprocess(
        batch, dutch_tokenizer, encoder_max_length, decoder_max_length
    ),
    batched=True,
    remove_columns=dutch_train_data_txt.column_names,
)

dutch_validation_data = dutch_validation_data_txt.map(
    lambda batch: batch_tokenize_preprocess(
        batch, dutch_tokenizer, encoder_max_length, decoder_max_length
    ),
    batched=True,
    remove_columns=dutch_validation_data_txt.column_names,
)

spanish_train_data = spanish_train_data_txt.map(
    lambda batch: batch_tokenize_preprocess(
        batch, spanish_tokenizer, encoder_max_length, decoder_max_length
    ),
    batched=True,
    remove_columns=spanish_train_data_txt.column_names,
)

spanish_validation_data = spanish_validation_data_txt.map(
    lambda batch: batch_tokenize_preprocess(
        batch, spanish_tokenizer, encoder_max_length, decoder_max_length
    ),
    batched=True,
    remove_columns=spanish_validation_data_txt.column_names,
)

Map:   0%|          | 0/25877 [00:00<?, ? examples/s]

Map:   0%|          | 0/2876 [00:00<?, ? examples/s]

Map:   0%|          | 0/26364 [00:00<?, ? examples/s]

Map:   0%|          | 0/2930 [00:00<?, ? examples/s]

In [7]:
# Define the training arguments
dutch_training_args = Seq2SeqTrainingArguments(
    output_dir="dutch_results",
    num_train_epochs=3,
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.1,
    label_smoothing_factor=0.1,
    predict_with_generate=True,
    logging_dir="dutch_logs",
    logging_steps=50,
    save_total_limit=3,
)

spanish_training_args = Seq2SeqTrainingArguments(
    output_dir="spanish_results",
    num_train_epochs=3,
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.1,
    label_smoothing_factor=0.1,
    predict_with_generate=True,
    logging_dir="spanish_logs",
    logging_steps=50,
    save_total_limit=3,
)

# Define the data collator
dutch_data_collator = DataCollatorForSeq2Seq(dutch_tokenizer, model=dutch_model)
spanish_data_collator = DataCollatorForSeq2Seq(spanish_tokenizer, model=spanish_model)

# Define the trainer
dutch_trainer = Seq2SeqTrainer(
    model=dutch_model,
    args=dutch_training_args,
    data_collator=dutch_data_collator,
    train_dataset=dutch_train_data,
    eval_dataset=dutch_validation_data,
    tokenizer=dutch_tokenizer,
)

spanish_trainer = Seq2SeqTrainer(
    model=spanish_model,
    args=spanish_training_args,
    data_collator=spanish_data_collator,
    train_dataset=spanish_train_data,
    eval_dataset=spanish_validation_data,
    tokenizer=spanish_tokenizer,
)

  dutch_trainer = Seq2SeqTrainer(


OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 27.25 MiB is free. Process 248998 has 694.00 MiB memory in use. Process 992207 has 22.55 GiB memory in use. Including non-PyTorch memory, this process has 372.00 MiB memory in use. Of the allocated memory 206.80 MiB is allocated by PyTorch, and 3.20 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [21]:
# Train the models
dutch_trainer.train()
spanish_trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss
50,5.2361
100,5.0394
150,4.9431
200,4.8745
250,4.8017
300,4.7365
350,4.6709
400,4.592
450,4.5834
500,4.6318


Step,Training Loss
50,4.6484
100,4.3691
150,4.1664
200,4.0874
250,3.9673
300,3.9338
350,3.884
400,3.8332
450,3.8152
500,3.7901


TrainOutput(global_step=19773, training_loss=3.226625258358393, metrics={'train_runtime': 2854.1859, 'train_samples_per_second': 27.711, 'train_steps_per_second': 6.928, 'total_flos': 2.408184874008576e+16, 'train_loss': 3.226625258358393, 'epoch': 3.0})

In [9]:
# Save the models
dutch_model.save_pretrained("t5-small_dutch_BeforeAugmentation")
dutch_tokenizer.save_pretrained("t5-small_dutch_BeforeAugmentation")

spanish_model.save_pretrained("t5-base_spanish_BeforeAugmentation")
spanish_tokenizer.save_pretrained("t5-base_spanish_BeforeAugmentation")

('t5-base_spanish_BeforeAugmentation/tokenizer_config.json',
 't5-base_spanish_BeforeAugmentation/special_tokens_map.json',
 't5-base_spanish_BeforeAugmentation/spiece.model',
 't5-base_spanish_BeforeAugmentation/added_tokens.json',
 't5-base_spanish_BeforeAugmentation/tokenizer.json')

#### ------- Loading the Models Only - not part of tasks - for my ease --------

In [None]:
dutch_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small_dutch_BeforeAugmentation")
dutch_tokenizer = AutoTokenizer.from_pretrained("t5-small_dutch_BeforeAugmentation")
print("Dutch model loaded successfully")

spanish_model = AutoModelForSeq2SeqLM.from_pretrained("t5-base_spanish_BeforeAugmentation")
spanish_tokenizer = AutoTokenizer.from_pretrained("t5-base_spanish_BeforeAugmentation")
print("Spanish model loaded successfully")

Dutch model loaded successfully
Spanish model loaded successfully


# Task 2 [20 Points]: Evaluation
You already saw rouge score for evaluation earlier. That isn't the only evaluation metric as we saw in class. Implement 2 more evaluation metrics: BLEU Score and BERT Score. Both metrics and rouge score will evaluate both the models i.e. 3 evaluation criterias for each model.


In [None]:
# Write your code for Task 2 here. You can utlize the functions given earlier for this as well.

In [None]:
! pip install bert_score --break-system-packages

In [10]:
# Import necessary libraries
import evaluate
from bert_score import score as bert_score

In [11]:
# Load the metrics
bleu_metric = evaluate.load("bleu")
rouge_metric = evaluate.load("rouge")

# Define the evaluation function
def compute_metrics(eval_preds, tokenizer):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Compute BLEU score
    bleu_result = bleu_metric.compute(predictions=decoded_preds, references=decoded_labels)

    # Compute ROUGE score
    rouge_result = rouge_metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    # Compute BERT score
    P, R, F1 = bert_score(decoded_preds, decoded_labels, lang="en")

    return {
        "bleu": bleu_result["bleu"],
        "rouge": rouge_result,
        "bert_score": F1.mean().item()
    }

In [14]:
# Evaluate the Dutch model
# dutch_eval_results = dutch_trainer.evaluate()
# dutch_metrics = compute_metrics((dutch_eval_results.predictions, dutch_eval_results.label_ids), dutch_tokenizer)
dutch_preds = dutch_trainer.predict(dutch_validation_data)
dutch_metrics = compute_metrics((dutch_preds.predictions, dutch_preds.label_ids), dutch_tokenizer)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
# Print the results
print("Dutch Model Metrics:")
print(dutch_metrics)

Dutch Model Metrics:
{'bleu': 0.001908603138159173, 'rouge': {'rouge1': 0.1032764394401323, 'rouge2': 0.016565502157507038, 'rougeL': 0.09033051054090999, 'rougeLsum': 0.09031703621076392}, 'bert_score': 0.7769189476966858}


In [12]:
# Evaluate the Spanish model
# spanish_eval_results = spanish_trainer.evaluate()
# spanish_metrics = compute_metrics((spanish_eval_results.predictions, spanish_eval_results.label_ids), spanish_tokenizer)
spanish_preds = spanish_trainer.predict(spanish_validation_data)
spanish_metrics = compute_metrics((spanish_preds.predictions, spanish_preds.label_ids), spanish_tokenizer)

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


[34m[1mwandb[0m: Currently logged in as: [33maa07190[0m ([33maa07190-habib-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
print("Spanish Model Metrics:")
print(spanish_metrics)

Spanish Model Metrics:
{'bleu': 0.004221800724303033, 'rouge': {'rouge1': 0.11402956450832896, 'rouge2': 0.023624609183256588, 'rougeL': 0.09960721689678831, 'rougeLsum': 0.09968149530677677}, 'bert_score': 0.8077273368835449}


# Task 3 [40 points]: Data Augmentation
There was various ways to augment data (depends on the type of data actually though). Since in LLMs we are dealing in text majorly, let's talk about text data augmentation.

The most prominent methods to augment data in text are these.

    Word or sentence shuffling: randomly changing the position of a word or sentence.
    Synonym replacement: replace words with synonyms.
    Syntax-tree manipulation: paraphrase the sentence using the same word. e.g:the cat sat on the mat to on the mat, the cat sat.
    Random word insertion: inserts words at random.
    Random word deletion: deletes words at random.
    Antonym Replacement: Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its antonyms chosen at random.
    Backtranslation: translate your given data to some other language and translate it back to the original language.
  
There are various ways to implement these ways for augmenting data. You can write your own functions and utlize some dictionary library to do it (for synonym/anotnym replacement and word insertion). For backtranslation, maybe you can you some translation library. But, that is too time consuming to implement. Another possible approach is feeding your data to a large model for augmentation but that is costly, and requires your prompt to be really good so that model doesn't hallucinate. Moreover, all these tasks at some point were done by someone. So why not use their standard code?

So let me introduce you to [nlpaug](https://nlpaug.readthedocs.io/en/latest/). This library is used for data augmentation for not just text but for audio and images as well. You will use it to augment your data. There are 7 ways listed in the data augmentation methods. You will use backtranslation for augmenting data for both languages. From remaining 6, you will use any 3 methods for augmenting data for first language and other 3 methods for second language. For each method, you will generate 10 new samples.

After completing the augmentation, train your models again for 3 epochs and evaluate them against all 3 evaluation metrics.

After fine tuning, save your models on your drive as ModelName_LanguageName_AfterAugmentation where on the place of model name you will write the name of the model and on the place of language name you will write the name of language.

In [None]:
# Write your code for Task 3 here. You can utlize the functions given earlier for this as well.

In [None]:
! pip install nlpaug --break-system-packages
! pip install sacremoses --break-system-packages

In [23]:
# Sample code to follow:

import datasets
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import json
import random
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from datasets import Dataset, concatenate_datasets

# Download required NLTK resources
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/aa07190/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/aa07190/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/aa07190/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [24]:
print(nltk.data.path)

['/home/aa07190/nltk_data', '/usr/nltk_data', '/usr/share/nltk_data', '/usr/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']


In [17]:
# Load backtranslation models
backtrans_aug_dutch = naw.BackTranslationAug(
    from_model_name='Helsinki-NLP/opus-mt-nl-en',  # Dutch → English
    to_model_name='Helsinki-NLP/opus-mt-en-nl'    # English → Dutch
)
backtrans_aug_spanish = naw.BackTranslationAug(
    from_model_name='Helsinki-NLP/opus-mt-es-en',  # Spanish → English
    to_model_name='Helsinki-NLP/opus-mt-en-es'    # English → Spanish
)

In [18]:
# Define augmentation methods
synonym_aug = naw.SynonymAug(aug_src='wordnet', lang='nld')  # For Dutch
synonym_aug_spanish = naw.SynonymAug(aug_src='wordnet', lang='spa')  # For Spanish
insert_aug = nac.RandomCharAug(action="insert")
antonym_aug = naw.AntonymAug()

shuffle_aug = naw.RandomWordAug(action="swap")
delete_aug = naw.RandomWordAug(action="delete")
syntax_aug = nas.RandomSentAug(action="swap")

In [19]:
# Convert dataset to a list of dictionaries
data_dutch_list = list(dutch_data)
data_spanish_list = list(spanish_data)

In [20]:
# Split into four distinct groups (10 samples each)
random_samples_synonym_dutch = data_dutch_list[:10]
random_samples_shuffle_dutch = data_dutch_list[10:20]
random_samples_syntax_dutch = data_dutch_list[20:30]
random_samples_backtrans_dutch = data_dutch_list[30:40]

random_samples_synonym_spanish = data_spanish_list[:10]
random_samples_shuffle_spanish = data_spanish_list[10:20]
random_samples_syntax_spanish = data_spanish_list[20:30]
random_samples_backtrans_spanish = data_spanish_list[30:40]

In [21]:
# Define augmentation function
def augment_sample(sample, augmenter):
    augmented_sample = sample.copy()

    # Flatten the augmented paragraphs and summaries
    augmented_sample["article"]["document"] = [
        para for paragraph in sample["article"]["document"] for para in augmenter.augment(paragraph)
    ]

    augmented_sample["article"]["summary"] = [
        summ for summary in sample["article"]["summary"] for summ in augmenter.augment(summary)
    ]

    return augmented_sample

In [22]:
# Apply different augmentations to distinct samples (Dutch)
aug_synonym_dutch = [augment_sample(sample, synonym_aug) for sample in random_samples_synonym_dutch]
aug_shuffle_dutch = [augment_sample(sample, shuffle_aug) for sample in random_samples_shuffle_dutch]
aug_syntax_dutch = [augment_sample(sample, syntax_aug) for sample in random_samples_syntax_dutch]
aug_backtrans_dutch = [augment_sample(sample, backtrans_aug_dutch) for sample in random_samples_backtrans_dutch]

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/aa07190/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


LookupError: 
**********************************************************************
  Resource [93maveraged_perceptron_tagger_eng[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('averaged_perceptron_tagger_eng')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtaggers/averaged_perceptron_tagger_eng/[0m

  Searched in:
    - '/home/aa07190/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [None]:
# Apply different augmentations to distinct samples (Spanish)
aug_synonym_spanish = [augment_sample(sample, synonym_aug_spanish) for sample in random_samples_synonym_spanish]
aug_shuffle_spanish = [augment_sample(sample, shuffle_aug) for sample in random_samples_shuffle_spanish]
aug_syntax_spanish = [augment_sample(sample, syntax_aug) for sample in random_samples_syntax_spanish]
aug_backtrans_spanish = [augment_sample(sample, backtrans_aug_spanish) for sample in random_samples_backtrans_spanish]

In [None]:
# Combine all augmented data (Dutch)
aug_data_dutch = aug_synonym_dutch + aug_shuffle_dutch + aug_syntax_dutch + aug_backtrans_dutch

# Combine all augmented data (Spanish)
aug_data_spanish = aug_synonym_spanish + aug_shuffle_spanish + aug_syntax_spanish + aug_backtrans_spanish

# Convert augmented data lists to Datasets (ensuring proper structure)
aug_data_dutch = Dataset.from_list(aug_data_dutch)
aug_data_spanish = Dataset.from_list(aug_data_spanish)

In [None]:
# Define flatten and list2samples functions (from your original code)
def flatten(example):
    return {
        "document": example["article"]["document"],
        "summary": example["article"]["summary"],
    }

def list2samples(example):
    documents = []
    summaries = []
    for sample in zip(example["document"], example["summary"]):
        if len(sample[0]) > 0:
            documents += sample[0]
            summaries += sample[1]
    return {"document": documents, "summary": summaries}

In [None]:
# Apply transformations to ensure consistent format (Dutch)
data_dutch = dutch_data.map(flatten, remove_columns=["article", "url"])
data_dutch = data_dutch.map(list2samples, batched=True)

aug_data_dutch = aug_data_dutch.map(flatten, remove_columns=["article", "url"])
aug_data_dutch = aug_data_dutch.map(list2samples, batched=True)

# Apply transformations to ensure consistent format (Spanish)
data_spanish = spanish_data.map(flatten, remove_columns=["article", "url"])
data_spanish = data_spanish.map(list2samples, batched=True)

aug_data_spanish = aug_data_spanish.map(flatten, remove_columns=["article", "url"])
aug_data_spanish = aug_data_spanish.map(list2samples, batched=True)

In [None]:
# Concatenate datasets after format alignment (Dutch)
combined_data_dutch = concatenate_datasets([data_dutch, aug_data_dutch])

# Concatenate datasets after format alignment (Spanish)
combined_data_spanish = concatenate_datasets([data_spanish, aug_data_spanish])

In [None]:
# Tokenize the augmented data
dutch_train_data = combined_data_dutch.map(
    lambda batch: batch_tokenize_preprocess(
        batch, dutch_tokenizer, encoder_max_length, decoder_max_length
    ),
    batched=True,
    remove_columns=dutch_train_data_txt.column_names,
)

spanish_train_data = combined_data_spanish.map(
    lambda batch: batch_tokenize_preprocess(
        batch, spanish_tokenizer, encoder_max_length, decoder_max_length
    ),
    batched=True,
    remove_columns=spanish_train_data_txt.column_names,
)

In [None]:
# Train the models
dutch_trainer.train()
spanish_trainer.train()

In [None]:
# Save the models
dutch_model.save_pretrained("t5-small_dutch_AfterAugmentation")
dutch_tokenizer.save_pretrained("t5-small_dutch_AfterAugmentation")
spanish_model.save_pretrained("t5-base_spanish_AfterAugmentation")
spanish_tokenizer.save_pretrained("t5-base_spanish_AfterAugmentation")

In [None]:
# Evaluate the models after augmentation
dutch_preds = dutch_trainer.predict(dutch_validation_data)
dutch_metrics = compute_metrics((dutch_preds.predictions, dutch_preds.label_ids), dutch_tokenizer)

In [None]:
print("Dutch Model Metrics After Augmentation:")
print(dutch_metrics)

In [None]:
spanish_preds = spanish_trainer.predict(spanish_validation_data)
spanish_metrics = compute_metrics((spanish_preds.predictions, spanish_preds.label_ids), spanish_tokenizer)

In [None]:
print("Spanish Model Metrics After Augmentation:")
print(spanish_metrics)