## Fine-tuning using Cascaded Training

Due to limited computational resources, the fine-tuning of the mBART model was conducted using a **cascaded training** approach — splitting the process into sequential, resource-efficient stages. This allowed us to gradually adapt the model while managing GPU memory and training time.

The fine-tuning process was divided into the following five phases:

### Table of Contents
- **Phase 1** – Decoder-only training
- **Phase 2** – Encoder-only training
- **Phase 3** – Joint encoder-decoder fine-tuning
- **Phase 4** – Low-Rank Adaptation (LoRA) with quantization
- **Phase 5** – Final LoRA refinement with longer input/output


### Phase 1 - Decoder training

In the first phase mBART decoder was trained and encoder freezed (weights remained the same). The model learned to adjust text generation mechanism without interacting with it's ability to comprehend the text. Training the decoder separately allows the model to focus on the linguistic fluency and structure of summaries without compromising the encoder's ability to extract meaningful features from the text



### Installing packages
* -U - upgrade
* transformers - models and librarries
* huggingface_hub - connects and allows to use or import models to huggingface
* datasets - data preparation, import
* evaluate - metrics to valuate results
* accelerate - optimizing training performance
* --force-reinstall --no-cache-dir transformers huggingface_hub - reinstalls the packages and bypassing cache copies (needed for colab environment to prevent huggingface and transformers versions conflict)
* rouge_score - installs rogue metrics
* bitsandbytes - for low VRAM LoRA training
* peft - framework for PEFT method (together with transformers and accelerate due to peft need of specific version)

In [None]:
!pip install -U transformers huggingface_hub datasets evaluate accelerate
!pip install --force-reinstall --no-cache-dir transformers huggingface_hub
!pip install rouge_score
!pip install -U bitsandbytes peft transformers accelerate
!pip install evaluate rouge_score nltk bert_score

### Import modules and functions

**torch**
* torch.cuda.empty_cache() - Frees GPU memory from PyTorch cache (GPU memory optimization (essential for limited resource environments like Colab)
* os.environ["PYTORCH_CUDA_ALLOC_CONF"] - PyTorch environment configuration to optimize GPU memory management during training
* expandable_segments:True - allows for more efficient management of segments in GPU memory
* garbage_collection_threshold:0.8 - memory is cleared when 80% GPU load is reached

**transformers**
* MBartTokenizer - mBART text to token coder
* MBartForConditionalGeneration - model architecture for text generation (sequence-to-sequence)
* Seq2SeqTrainingArguments - training configuration (batch size, epoch number, optimizer)
* Seq2SeqTrainer - sequence-to-sequence trainer with aditional text generate and valuation logic
* DataCollatorForSeq2Seq - aligns all texts and summaries to the same length in the batches

**peft (Parameter-Efficient Fine-Tuning)**
* LoraConfig - allows to set how many layers of weights to apply, tasks and where to use LoRA (overall saves GPU resources)
* get_peft_model - combines model with LoRA configuration (saves GPU resources)
* TaskType - indicates type of task (needed for LoraConfig for correct LoRA integration)

In [None]:
import torch
torch.cuda.empty_cache()

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True,garbage_collection_threshold:0.8"

import pandas as pd
from datasets import Dataset, DatasetDict
from transformers import (
    MBartTokenizer,
    MBartForConditionalGeneration,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)
import evaluate
from huggingface_hub import login
from peft import LoraConfig, get_peft_model, TaskType

### Training data
Uploading training, validation and testing data

* model_checkpoint - model name from Hugging Face Hub
* src_lang="lt_LT" - input in Lithuanian language
* tgt_lang="lt_LT"- output in Lithuanian language

In [None]:
train_df = pd.read_csv("/content/train.csv")
val_df = pd.read_csv("/content/validation.csv")
test_df = pd.read_csv("/content/test.csv")

dataset = DatasetDict({
    "train": Dataset.from_pandas(train_df),
    "validation": Dataset.from_pandas(val_df),
    "test": Dataset.from_pandas(test_df),
})

model_checkpoint = "facebook/mbart-large-50"
tokenizer = MBartTokenizer.from_pretrained(model_checkpoint, src_lang="lt_LT", tgt_lang="lt_LT")

### Decoder training
## Inicialize and configure the model
* from_pretrained - uploading facebook/mbart-large-50
* load_in_4bit - model is loaded in 4 bit (to reduce memory consumption and increase the speed of calculation)
* device_map="auto" - automatically switches between GPU/CPU (usefull in environments like Colab to reduce the memmory)
* use_cache=False - disables cache (needed for LoRA model)
* torch_dtype=torch.bfloat16 - brain float is 16 bit format whitch decreases memory usage without the increase of loss in accuracy.
* model.enable_input_require_grads() - allows grad calculation (needed for LoRA model)

## Decoder training (freezing encoder)
* for name, param in model.model.encoder.named_parameters():
  param.requires_grad = freezing encoder

## LoRA configuraton
* r=128 - LoRA trains only small additional layers instead of the entire model (The larger the r, the more parameters LoRA can learn, but the more memory is required)
* lora_alpha=256 - scaling factor that controls how much the LoRA layer's values influence the model's output
* target_modules=["q_proj", "v_proj", "k_proj"] - only these attention layers will be wrapped with LoRA adapters querry, key and value
* get_peft_model() - this function "wraps" model with LoRA adapters and prepares it to train only the selected components

**Overall it enables highly efficient and fast training even on weaker GPUs**

## Tokenization
* truncation - if text is longer than max_length it will be truncated
* padding - adding special PAD tokens to set all the text to the same size (input = 640, output = 90 tokens). If text is too short, PAD will be added.

## Training parameters
* per_device_train_batch_size=1 - one example per time
* gradient_accumulation_steps=16 = 16 steps for eatch example (only then weights are updated)
**Setting batch size to 16 (per_device_train_batch_size * gradient_accumulation_steps) while saving GPU RAM
* adamw_bnb_8bit - weights and gradients are saved in 8 bit format, but the calclations are done in 32 bites (saves ~70% of the memory)

In [None]:
model = MBartForConditionalGeneration.from_pretrained(
    model_checkpoint,
    load_in_4bit=True,
    device_map="auto",
    use_cache=False,
    torch_dtype=torch.bfloat16
)
model.enable_input_require_grads()

for name, param in model.model.encoder.named_parameters():
    param.requires_grad = False

lora_config = LoraConfig(
    task_type="SEQ_2_SEQ_LM",
    inference_mode=False,
    r=128,
    lora_alpha=256,
    target_modules=["q_proj", "v_proj", "k_proj"],
    lora_dropout=0.05,
    bias="none"
)
model = get_peft_model(model, lora_config)

def preprocess_phase1(example):
    inputs = tokenizer(example["text"], max_length=640, truncation=True, padding="max_length")
    labels = tokenizer(example["summary"], max_length=90, truncation=True, padding="max_length")
    inputs["labels"] = labels["input_ids"]
    return inputs

tokenized_datasets = dataset.map(preprocess_phase1, batched=True)

training_args = Seq2SeqTrainingArguments(
    output_dir="phase1_decoder",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=0.0005,
    num_train_epochs=2,
    eval_strategy="steps",
    eval_steps=500,
    logging_steps=100,
    save_steps=1000,
    bf16=True,
    optim="adamw_bnb_8bit",
    push_to_hub=False
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
)

trainer.train()

model.save_pretrained("/content/phase1_decoder")
tokenizer.save_pretrained("/content/phase1_decoder")

### Phase 2 – Encoder-only training
In this phase, only the encoder part of the mBART model was fine-tuned while freezing the decoder.This step allows the model to better understand the structure and semantics of the input text before learning how to generate summaries. By training the encoder separately, we ensured that it could produce high-quality internal representations, which later benefit the decoder in downstream summary generation.


## Difference from Phase 1
* model = MBartForConditionalGeneration.from_pretrained("/content/phase1_decoder") - importing model from phase 1
* for name, param in model.model.decoder.named_parameters(): param.requires_grad = False - freezing decoder
* input max_length = 896
* output max_length = 90
* lr = 0.0003


In [None]:
model = MBartForConditionalGeneration.from_pretrained(
    "/content/phase1_decoder",
    load_in_4bit=True,
    device_map="auto",
    use_cache=False,
    torch_dtype=torch.bfloat16
)

model.enable_input_require_grads()

for name, param in model.model.decoder.named_parameters():
    param.requires_grad = False

model = get_peft_model(model, lora_config)

def preprocess_phase2(example):
    inputs = tokenizer(example["text"], max_length=896, truncation=True, padding="max_length")
    labels = tokenizer(example["summary"], max_length=90, truncation=True, padding="max_length")
    inputs["labels"] = labels["input_ids"]
    return inputs

tokenized_datasets = dataset.map(preprocess_phase2, batched=True)

training_args = Seq2SeqTrainingArguments(
    output_dir="phase2_encoder",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=0.0003,
    num_train_epochs=2,
    eval_strategy="steps",
    eval_steps=500,
    logging_steps=100,
    save_steps=1000,
    bf16=True,
    optim="adamw_bnb_8bit",
    push_to_hub=False
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
)

trainer.train()

model.save_pretrained("/content/phase2_encoder")
tokenizer.save_pretrained("/content/phase2_encoder")

#Phase 3 – Joint Encoder-Decoder Fine-Tuning
In this phase, we unfreeze both the encoder and decoder of the mBART model and fine-tune them jointly. This allows the model to learn deeper interactions between input texts and their target summaries. By training the full architecture end-to-end on the task-specific data, we enable it to align encoding and generation more effectively. This step consolidates the gains from previous phases and improves the overall summarization quality.

###Uploading model
* load_in_4bit=True - format reduces VRAM usage
* torch_dtype=torch.bfloat16 - format reduces VRAM usage
* device_map="auto" - model distribution between devices (GPU, CPU)
* use_cache=False - saves memory during training when cache is not needed Reduces VRAM usage (~10-20%).
* model.enable_input_require_grads() - forces the model's input layers to calculate gradients even if they would normally be frozen
* model = get_peft_model(model, lora_config) - LoRA adapters are added to the model (get_peft_model) so that you can train only a small number of parameters

### Training parameters
**As previously**
* per_device_train_batch_size=1 - one example per time
* gradient_accumulation_steps=16 = 16 steps for eatch example (only then weights are updated)
**Setting batch size to 16 (per_device_train_batch_size * gradient_accumulation_steps) while saving GPU RAM
* predict_with_generate=True - enables text generation when validating the model (validation/test)
* fp16=False/bf16=True - bfloat16 is more stable than fp16
* adamw_bnb_8bit - weights and gradients are saved in 8 bit format, but the calclations are done in 32 bites (saves ~70% of the memory)
*  data_collator=DataCollatorForSeq2Seq(tokenizer, model=model) - automatically padding texts and prepares input for the model

In [None]:
model = MBartForConditionalGeneration.from_pretrained(
    "/content/phase2_encoder",
    load_in_4bit=True,
    device_map="auto",
    use_cache=False,
    torch_dtype=torch.bfloat16
)
model.enable_input_require_grads()

model = get_peft_model(model, lora_config)

# Same tonenization as in phase 2
tokenized_datasets = dataset.map(preprocess_phase2, batched=True)

training_args = Seq2SeqTrainingArguments(
    output_dir="phase3_fullmodel",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=0.0001,
    num_train_epochs=3,
    eval_strategy="steps",
    eval_steps=500,
    logging_steps=100,
    save_steps=1000,
    bf16=True,
    optim="adamw_bnb_8bit",
    push_to_hub=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
)

trainer.train()

model.save_pretrained("/content/phase3_fullmodel")
tokenizer.save_pretrained("/content/phase3_fullmodel")

* model.eval() - switching to testing (turning off dropout, batch norm)
* predict_with_generate=True - turning text generation on (instead of predict)
* do_train=False - model weights remains the same
* do_eval=False - since we don't need to use eval in this phase, we are evaluating with test set
* skip_special_tokens=True - skipping PAD

# Metrics

###ROUGE - Word/sentence matches between generated and label summaries
* use_stemmer=True - combines similar word forms

###BLEU - N-gram accuracy, how many words/phrases generated by the model match the labels
###METEOR - Semantic similarity, including synonyms and word forms
###BERTScore - Semantic similarity using BERT embeddings
* Precision - How many words in the pattern correspond semantically to the labels
* Recall - How many label words to semantically reflect during generation
* F1 - Balance between precision and recall

In [None]:
model.eval()
torch.cuda.empty_cache()

test_args = Seq2SeqTrainingArguments(
    output_dir="/content/eval-output",
    per_device_eval_batch_size=1,
    predict_with_generate=True,
    do_train=False,
    do_eval=False,
    fp16=False,
)

test_trainer = Seq2SeqTrainer(
    model=model,
    args=test_args,
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer, model=model)
)

test_results = test_trainer.predict(tokenized_datasets["test"])
decoded_preds = tokenizer.batch_decode(test_results.predictions, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(test_results.label_ids, skip_special_tokens=True)

df_test_output = pd.DataFrame({
    "Original": test_df["text"],
    "Reference Summary": test_df["summary"],
    "Generated Summary": decoded_preds
})
csv_path = "/content/summary_results.csv"
df_test_output.to_csv(csv_path, index=False)
print("CSV išsaugotas:", csv_path)

rouge = load("rouge")
rouge_scores = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
print("\n ROUGE metrics:")
for k, v in rouge_scores.items():
    print(f"{k.upper()}: {v:.4f}")

bleu = load("bleu")
bleu_score = bleu.compute(predictions=decoded_preds, references=[[ref] for ref in decoded_labels])
print("\n BLEU SCORE:")
print(f"BLEU: {bleu_score['bleu']:.4f}")

meteor = load("meteor")
meteor_score = meteor.compute(predictions=decoded_preds, references=decoded_labels)
print("\n METEOR SCORE:")
print(f"METEOR: {meteor_score['meteor']:.4f}")

bertscore = load("bertscore")
bert_results = bertscore.compute(predictions=decoded_preds, references=decoded_labels, lang="en", model_type="xlm-roberta-large")

avg_precision = sum(bert_results["precision"]) / len(bert_results["precision"])
avg_recall = sum(bert_results["recall"]) / len(bert_results["recall"])
avg_f1 = sum(bert_results["f1"]) / len(bert_results["f1"])

print("\n BERTScore (xlm-roberta-large):")
print(f"Precision: {avg_precision:.4f}")
print(f"Recall:    {avg_recall:.4f}")
print(f"F1 Score:  {avg_f1:.4f}")

files.download(csv_path)

###Phase 3 Results (Joint encoder-decoder training)

ROUGE-1 (0.3417): Measures overlap of unigrams (individual words) between generated summaries and references.

ROUGE-2 (0.1545): Evaluates overlap of bigrams, indicating slightly deeper linguistic structure.

ROUGE-L (0.2614): Considers longest common subsequence, capturing fluency and sentence coherence.

BLEU (0.1106):
Assesses exact matching of n-grams; a low score here reflects limited word-by-word exact matching but not poor semantic quality.

METEOR (0.3032):
Evaluates semantic similarity, synonyms, and stemming; reflects moderate semantic accuracy.

BERTScore F1 (0.9012):
High score indicates very strong semantic similarity with references.

#Phase 4 – Low-Rank Adaptation (LoRA) with quantization
In this phase, we apply Low-Rank Adaptation (LoRA) to the fine-tuned mBART model, significantly reducing the number of trainable parameters. LoRA inserts lightweight trainable layers into specific parts of the model, allowing efficient fine-tuning without updating the full model weights.

Simultaneously, we apply 4-bit quantization, which reduces memory usage and computational cost by storing weights in lower precision. This combination allows for continued training with limited hardware resources (Google Colab) while maintaining or even improving performance

#Import modules
* import gc - gc.collect() cleans up Pythons memory
* os.environ["PYTORCH_CUDA_ALLOC_CONF"] - PyTorch environment configuration to optimize GPU memory management during training
* expandable_segments:True - allows for more efficient management of segments in GPU memory
* garbage_collection_threshold:0.8 - memory is cleared when 80% GPU load is reached

In [None]:
import gc
import torch
gc.collect()
torch.cuda.empty_cache()

### Training data
Uploading training and validation data

* model_checkpoint - model name from Hugging Face Hub
* src_lang="lt_LT" - input in Lithuanian language
* tgt_lang="lt_LT"- output in Lithuanian language

**Using fully trained model from phase 3**

In [None]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True,garbage_collection_threshold:0.8"
torch.cuda.empty_cache()

train_df = pd.read_csv("/content/train.csv")
val_df = pd.read_csv("/content/validation.csv")
dataset = DatasetDict({
    "train": Dataset.from_pandas(train_df),
    "validation": Dataset.from_pandas(val_df),
})

model_dir = "/content/phase3_fullmodel"
tokenizer = MBartTokenizer.from_pretrained(model_dir, src_lang="lt_LT", tgt_lang="lt_LT")

#LoRA training
###Tokenization
* Slightly increased output size to 96 tokens

###Uploading LoRA model
* config = PeftConfig.from_pretrained(model_dir) - importing peft configurations from previously saved LoRA adapter
* base_model = MBartForConditionalGeneration.from_pretrained - importing original mBART model as base for LoRA adapters
* config.base_model_name_or_path - original model path
* device_map="auto" - distributes the model across the GPU automatically
* model = PeftModel.from_pretrained - combines base mBART model with LoRA adapters (from model_dir)
* model.enable_input_require_grads() - enables backpropagation of gradients
* if "lora" in name - cickle through all model parameters, but  backpropagation of gradients enabled only for layers with names "lora" (adapters parameters), others remain freezed

### Training parameters
**As previously**
* per_device_train_batch_size=1 - one example per time
* gradient_accumulation_steps=16 = 16 steps for eatch example (only then weights are updated)
**Setting batch size to 16 (per_device_train_batch_size * gradient_accumulation_steps) while saving GPU RAM
* predict_with_generate=True - enables text generation when validating the model (validation/test)
* fp16=False/bf16=True - bfloat16 is more stable than fp16
* adamw_bnb_8bit - weights and gradients are saved in 8 bit format, but the calclations are done in 32 bites (saves ~70% of the memory)

### Trainer
* trainer = Seq2SeqTrainer() - combines model, data, tokenizer and evaluation logic
* model=model - previously trained mBART model with LoRA adapters
* args=training_args - where to save model, number of epochs, batch size, optimizer
* train_dataset=tokenized_datasets["train"] - tokenized train dataset
* eval_dataset=tokenized_datasets["validation"] - tokenized validation dataset
* tokenizer=tokenizer - tokenizing input/output text, calculating eval metrics
* data_collator=DataCollatorForSeq2Seq() - assembles batches by properly padding them into tensors of the same size (without it, different length texts would cost errors)

In [None]:
def preprocess_function(examples):
    inputs = tokenizer(examples["text"], max_length=896, truncation=True, padding="max_length")
    labels = tokenizer(examples["summary"], max_length=96, truncation=True, padding="max_length")
    inputs["labels"] = labels["input_ids"]
    return inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)

config = PeftConfig.from_pretrained(model_dir)
base_model = MBartForConditionalGeneration.from_pretrained(
    config.base_model_name_or_path,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, model_dir, torch_dtype=torch.bfloat16, device_map="auto")

model.enable_input_require_grads()
for name, param in model.named_parameters():
    if "lora" in name:
        param.requires_grad = True

model.print_trainable_parameters()

login(token="hf")

training_args = Seq2SeqTrainingArguments(
    output_dir="/content/mbart-lt-phase4",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=0.0002,
    num_train_epochs=3,
    predict_with_generate=True,
    logging_steps=100,
    eval_steps=500,
    save_steps=1000,
    fp16=False,
    bf16=True,
    optim="adamw_bnb_8bit",
    hub_model_id="Arnold001/mbart-lt-summary-phase4"
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model),
)

trainer.train()

save_path = "/content/mbart-lt-phase4"
trainer.save_model(save_path)
tokenizer.save_pretrained(save_path)

###Validation
* val_dataset = val_dataset.map(preprocess_function, batched=True) - processes validation data to prepare it for the model input format. Tokenizes texts and summaries and processes data in batches
* PeftConfig.from_pretrained(model_dir) - The configuration is loaded from the model_dir directory, which stores information about how LoRA adapters were trained
* model = PeftModel.from_pretrained(base_model, model_dir, device_map="auto", torch_dtype=torch.bfloat16)  - LoRA adapters from the model_dir directory are added to the base model. The weights of the base model remain frozen – only the adapter parameters change

**device_map="auto" and torch_dtype=torch.bfloat16: Ensures that adapters are loaded in a format consistent with the base model**

* rouge = evaluate.load("rouge") - assesses the quality of summaries
* predictions, labels = eval_pred  - model predictions and labels are decoded
* use_stemmer=True - compares the roots of words

* predict_with_generate=True - the model generates text (summaries) during evaluation
* bf16=True - uses 16-bit calculations to save VRAM.
* do_train=False - disables the learning process (assessment only).

In [None]:
model_dir = "/content/mbart-lt-phase4"

val_df = pd.read_csv("/content/validation.csv")
val_dataset = Dataset.from_pandas(val_df)

tokenizer = MBartTokenizer.from_pretrained(model_dir, src_lang="lt_LT", tgt_lang="lt_LT")

def preprocess_function(examples):
    inputs = tokenizer(examples["text"], max_length=896, truncation=True, padding="max_length")
    labels = tokenizer(examples["summary"], max_length=96, truncation=True, padding="max_length")
    inputs["labels"] = labels["input_ids"]
    return inputs

val_dataset = val_dataset.map(preprocess_function, batched=True)

config = PeftConfig.from_pretrained(model_dir)
base_model = MBartForConditionalGeneration.from_pretrained(config.base_model_name_or_path, device_map="auto", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base_model, model_dir, device_map="auto", torch_dtype=torch.bfloat16)

rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    return rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

eval_args = Seq2SeqTrainingArguments(
    output_dir="/content/eval-output",
    per_device_eval_batch_size=1,
    predict_with_generate=True,
    do_train=False,
    do_eval=True,
    bf16=True,
    logging_dir=None
)

eval_trainer = Seq2SeqTrainer(
    model=model,
    args=eval_args,
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model),
    compute_metrics=compute_metrics
)

val_results = eval_trainer.evaluate(eval_dataset=val_dataset)
print("Validation Results:")
print(val_results)

###Test

###Metrics

###ROUGE - Word/sentence matches between generated and label summaries
* use_stemmer=True - combines similar word forms

###BLEU - N-gram accuracy, how many words/phrases generated by the model match the labels
###METEOR - Semantic similarity, including synonyms and word forms
###BERTScore - Semantic similarity using BERT embeddings
* Precision - How many words in the pattern correspond semantically to the labels
* Recall - How many label words to semantically reflect during generation
* F1 - Balance between precision and recall

In [None]:
test_df = pd.read_csv("/content/test.csv")
test_dataset = Dataset.from_pandas(test_df)
test_dataset = test_dataset.map(preprocess_function, batched=True)

test_results = eval_trainer.predict(test_dataset)
decoded_preds = tokenizer.batch_decode(test_results.predictions, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(test_results.label_ids, skip_special_tokens=True)

df_test_output = pd.DataFrame({
    "Original": test_df["text"],
    "Reference Summary": test_df["summary"],
    "Generated Summary": decoded_preds
})
df_test_output.to_csv("summary_results.csv", index=False)
print("📁 Santraukos išsaugotos į summary_results.csv")

rouge = evaluate.load("rouge")
rouge_scores = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
print("\n ROUGE metrikos:")
for k, v in rouge_scores.items():
    print(f"{k.upper()}: {v:.4f}")

bleu = evaluate.load("bleu")
bleu_score = bleu.compute(predictions=decoded_preds, references=[[ref] for ref in decoded_labels])
print("\n BLEU SCORE:")
print(f"BLEU: {bleu_score['bleu']:.4f}")

meteor = evaluate.load("meteor")
meteor_score = meteor.compute(predictions=decoded_preds, references=decoded_labels)
print("\n METEOR SCORE:")
print(f"METEOR: {meteor_score['meteor']:.4f}")

bertscore = evaluate.load("bertscore")
bert_results = bertscore.compute(predictions=decoded_preds, references=decoded_labels, lang="en", model_type="xlm-roberta-large")

avg_precision = sum(bert_results["precision"]) / len(bert_results["precision"])
avg_recall = sum(bert_results["recall"]) / len(bert_results["recall"])
avg_f1 = sum(bert_results["f1"]) / len(bert_results["f1"])

print("\n BERTScore:")
print(f"Precision: {avg_precision:.4f}")
print(f"Recall:    {avg_recall:.4f}")
print(f"F1 Score:  {avg_f1:.4f}")

###Phase 4 Results (LoRA training)

ROUGE-1 increased to 0.365 (+0.0233), indicating improved lexical precision.

ROUGE-2 increased to 0.1653 (+0.0108), reflecting better phrase-level coherence.

ROUGE-L increased to 0.2707 (+0.0093), demonstrating better fluency.

BLEU slightly increased to 0.1121 (+0.0015), minimal improvement in exact n-gram matching.

METEOR increased to 0.3101 (+0.0069), indicating enhanced semantic precision.

BERTScore F1 slightly improved to 0.9025 (+0.0013), maintaining high semantic accuracy.

Introducing LoRA allowed more efficient fine-tuning without sacrificing quality. All metrics improved, highlighting LoRA's effectiveness in capturing nuanced semantic and structural details.

#Phase 5 – Final LoRA refinement with longer input/output

In this final phase, we perform an additional round of LoRA-based fine-tuning, but this time we increase the input and output sequence lengths. This allows the model to better handle longer articles and generate more detailed summaries.  We continue updating only the lightweight LoRA layers—keeping the model efficient—while allowing it to learn richer representations from longer texts.

This phase maximizes the model's summarization quality within resource constraints, completing the cascaded training process.

##Difference from Phase 4
* input max_length increased from 896 to 1024
* output max_length increased from 96 to 128

In [None]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True,garbage_collection_threshold:0.8"
torch.cuda.empty_cache()

train_df = pd.read_csv("/content/train.csv")
val_df = pd.read_csv("/content/validation.csv")
dataset = DatasetDict({
    "train": Dataset.from_pandas(train_df),
    "validation": Dataset.from_pandas(val_df),
})

def preprocess_function(examples):
    inputs = tokenizer(examples["text"], max_length=1024, truncation=True, padding="max_length")
    labels = tokenizer(examples["summary"], max_length=128, truncation=True, padding="max_length")
    inputs["labels"] = labels["input_ids"]
    return inputs

model_dir = "/content/phase4_fullmodel"
config = PeftConfig.from_pretrained(model_dir)

base_model = MBartForConditionalGeneration.from_pretrained(
    config.base_model_name_or_path,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, model_dir, torch_dtype=torch.bfloat16, device_map="auto")

tokenizer = MBartTokenizer.from_pretrained(model_dir, src_lang="lt_LT", tgt_lang="lt_LT")

tokenized_datasets = dataset.map(preprocess_function, batched=True)

login(token="hf")

training_args = Seq2SeqTrainingArguments(
    output_dir="/content/mbart-lt-phase5",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=0.0002,
    num_train_epochs=3,
    predict_with_generate=True,
    logging_steps=100,
    eval_steps=500,
    save_steps=1000,
    bf16=True,
    fp16=False,
    optim="adamw_bnb_8bit",
    hub_model_id="Arnold001/mbart-lt-summary-phase5"
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model),
)

trainer.train()

save_path = "/content/mbart-lt-phase5"
trainer.save_model(save_path)
tokenizer.save_pretrained(save_path)

In [None]:
model_dir = "/content/mbart-lt-phase5"

val_df = pd.read_csv("/content/validation.csv")
val_dataset = Dataset.from_pandas(val_df)

tokenizer = MBartTokenizer.from_pretrained(model_dir, src_lang="lt_LT", tgt_lang="lt_LT")

def preprocess_function(examples):
    inputs = tokenizer(examples["text"], max_length=896, truncation=True, padding="max_length")
    labels = tokenizer(examples["summary"], max_length=96, truncation=True, padding="max_length")
    inputs["labels"] = labels["input_ids"]
    return inputs

val_dataset = val_dataset.map(preprocess_function, batched=True)

config = PeftConfig.from_pretrained(model_dir)
base_model = MBartForConditionalGeneration.from_pretrained(config.base_model_name_or_path, device_map="auto", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base_model, model_dir, device_map="auto", torch_dtype=torch.bfloat16)

rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    return rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

eval_args = Seq2SeqTrainingArguments(
    output_dir="/content/eval-output",
    per_device_eval_batch_size=1,
    predict_with_generate=True,
    do_train=False,
    do_eval=True,
    bf16=True,
    logging_dir=None
)

eval_trainer = Seq2SeqTrainer(
    model=model,
    args=eval_args,
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model),
    compute_metrics=compute_metrics
)

val_results = eval_trainer.evaluate(eval_dataset=val_dataset)
print("Validation Results:")
print(val_results)

###Test

###Metrics

###ROUGE - Word/sentence matches between generated and label summaries
* use_stemmer=True - combines similar word forms

###BLEU - N-gram accuracy, how many words/phrases generated by the model match the labels
###METEOR - Semantic similarity, including synonyms and word forms
###BERTScore - Semantic similarity using BERT embeddings
* Precision - How many words in the pattern correspond semantically to the labels
* Recall - How many label words to semantically reflect during generation
* F1 - Balance between precision and recall

In [None]:
test_df = pd.read_csv("/content/test.csv")
test_dataset = Dataset.from_pandas(test_df)
test_dataset = test_dataset.map(preprocess_function, batched=True)

test_results = eval_trainer.predict(test_dataset)
decoded_preds = tokenizer.batch_decode(test_results.predictions, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(test_results.label_ids, skip_special_tokens=True)

df_test_output = pd.DataFrame({
    "Original": test_df["text"],
    "Reference Summary": test_df["summary"],
    "Generated Summary": decoded_preds
})
df_test_output.to_csv("summary_results.csv", index=False)
print("📁 Santraukos išsaugotos į summary_results.csv")

rouge = evaluate.load("rouge")
rouge_scores = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
print("\n ROUGE metrikos:")
for k, v in rouge_scores.items():
    print(f"{k.upper()}: {v:.4f}")

bleu = evaluate.load("bleu")
bleu_score = bleu.compute(predictions=decoded_preds, references=[[ref] for ref in decoded_labels])
print("\n BLEU SCORE:")
print(f"BLEU: {bleu_score['bleu']:.4f}")

meteor = evaluate.load("meteor")
meteor_score = meteor.compute(predictions=decoded_preds, references=decoded_labels)
print("\n METEOR SCORE:")
print(f"METEOR: {meteor_score['meteor']:.4f}")

bertscore = evaluate.load("bertscore")
bert_results = bertscore.compute(predictions=decoded_preds, references=decoded_labels, lang="en", model_type="xlm-roberta-large")

avg_precision = sum(bert_results["precision"]) / len(bert_results["precision"])
avg_recall = sum(bert_results["recall"]) / len(bert_results["recall"])
avg_f1 = sum(bert_results["f1"]) / len(bert_results["f1"])

print("\n BERTScore:")
print(f"Precision: {avg_precision:.4f}")
print(f"Recall:    {avg_recall:.4f}")
print(f"F1 Score:  {avg_f1:.4f}")

###Phase 5 Results (Final LoRA with longer input/output)

ROUGE-1 slightly improved to 0.3655 (+0.0005), essentially stable.

ROUGE-2 improved to 0.168 (+0.0027), indicating a small enhancement in capturing two-word combinations.

ROUGE-L improved to 0.2752 (+0.0045), slight boost in coherence.

BLEU slightly decreased to 0.1096 (-0.0025), negligible change and statistically insignificant.

METEOR further increased to 0.3116 (+0.0015), reflecting stable semantic gains.

BERTScore F1 slightly increased to 0.9027 (+0.0002), sustaining excellent semantic performance.

Expanding input and output length allowed the model to utilize richer contextual information, maintaining semantic accuracy and coherence with slight incremental gains across metrics.