# Fine‑tuning **mT5‑base** with **LoRA** for Informal → Formal Style Transfer (Persian)

Name: `MohammadParsa Dini`
 
Student ID: `400101204`

Welcome! In this assignment, you’ll build an application that converts informal Persian sentences to formal ones.

You will:

1. **Pre‑process** the *ParsMap* informal–formal corpus with the `hazm` library.  
2. **Compute** input/output *token‑length statistics* to choose sensible `max_length` values.  
3. **Fine‑tune** the multilingual T5‑base model (`google/mt5-base`) using **Low‑Rank Adaptation (LoRA)**.  
4. **Evaluate** your model with BLEU and **perplexity**.  
5. **Explore** *stochastic decoding* strategies (temperature, top‑k, nucleus) and discuss diversity vs. quality.

Fill in each **`TODO`** region with code or text.  
When you finish, submit the completed notebook with a brief discussion section at the end summarising your findings.

### Key References  

| Topic | Paper |
|-------|------------------------------|
| Corpus | *Ehsani et al.* “Developing an Informal‑Formal Persian Corpus.” 🇮🇷 |
| Model | *Xue et al.* “mT5: A Massively Multilingual Pre‑trained Text‑to‑Text Transformer.” TACL 2021 |
| Fine‑tuning | *Hu et al.* “LoRA: Low‑Rank Adaptation of Large Language Models.” ICML 2022 |
| Decoding | *Holtzman et al.* “The Curious Case of Neural Text Degeneration.” ICLR 2020 |


## 1 · Environment & Dependencies  
Run the next cell **once** (commented by default) to install the dependencies.




In [1]:
pip install pandas==2.2.3 numpy==1.24.3 tqdm==4.67.1 hazm==0.10.0 datasets==3.1.0 transformers==4.46.3 peft==0.15.2 evaluate==0.4.3 accelerate==1.2.0 sacrebleu==1.5.1 jupyterlab==4.3.2

Collecting numpy==1.24.3
  Downloading numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting hazm==0.10.0
  Downloading hazm-0.10.0-py3-none-any.whl.metadata (11 kB)
Collecting datasets==3.1.0
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting transformers==4.46.3
  Downloading transformers-4.46.3-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft==0.15.2
  Downloading peft-0.15.2-py3-none-any.whl.metadata (13 kB)
Collecting evaluate==0.4.3
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting accelerate==1.2.0
  Downloading accelerate-1.2.0-py3-none-any.whl.metadata (19 kB)
Collecting sacrebleu==1.5.1
  Downloading sacrebleu-1.5.1-py3-none-any.whl.metadata (1.3 kB)
Collecting jupyterlab==4.3.2
  Downloading jupyterlab-4.3.2-py3-none-any.whl.metadata (16 k

In [2]:
# 📦 Imports
import pandas as pd
import numpy as np
from tqdm import tqdm
from hazm import Normalizer
from datasets import Dataset, DatasetDict
from transformers import (AutoTokenizer, AutoModelForSeq2SeqLM,
                          DataCollatorForSeq2Seq, Seq2SeqTrainingArguments,
                          Seq2SeqTrainer)

def normalize_persian(text: str) -> str:
    # Remove ZWNJ (zero-width non-joiner) and similar control chars
    text = re.sub(r'[\u200c\u200e\u200f]', '', text)

    # Replace Arabic chars with Persian equivalents
    text = text.replace('ي', 'ی').replace('ك', 'ک')

    # Normalize numbers to Persian (optional)
    # text = text.translate(str.maketrans('0123456789', '۰۱۲۳۴۵۶۷۸۹'))

    # Remove diacritics (Tashdid, Fatha, etc.)
    text = ''.join([c for c in unicodedata.normalize('NFKD', text)
                    if not unicodedata.combining(c)])

    # Standardize punctuation spacing
    text = re.sub(r'\s+', ' ', text)                  # Normalize spaces
    text = re.sub(r'\s([.,؛؛،!?؟])', r'\1', text)      # Remove space before punct
    text = re.sub(r'([.,؛؛،!?؟])(?=\S)', r'\1 ', text) # Add space after punct
    text = text.strip()
    
    return text
print('imports done!')

2025-05-27 11:06:50.739951: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748344010.924347      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748344010.979244      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


imports done!


In [3]:
import pandas as pd
from datasets import Dataset, DatasetDict
from hazm import Normalizer

# Initialize the normalizer
normalizer = Normalizer()

# Set FILE_PATH
FILE_PATH = "/kaggle/input/test-l/ParsMap.xlsx"

# 1. Load the file
df = pd.read_excel(FILE_PATH)[['inFormalForm', 'formalForm']].rename(
    columns={'inFormalForm': 'input', 'formalForm': 'target'}
)

# Drop rows with missing values (optional but recommended)
df = df.dropna(subset=['input', 'target'])

# 2. Normalize (convert to str first to avoid errors)
df['input'] = df['input'].astype(str).apply(normalizer.normalize)
df['target'] = df['target'].astype(str).apply(normalizer.normalize)

# 3. Split to HF DatasetDict
full_ds = Dataset.from_pandas(df)
full_ds = full_ds.shuffle(seed=42)
split_ds = full_ds.train_test_split(test_size=0.10, seed=42)
val_test = split_ds['test'].train_test_split(test_size=0.50, seed=42)
dataset = DatasetDict({
    'train': split_ds['train'],
    'validation': val_test['train'],
    'test': val_test['test']
})

print(dataset)


DatasetDict({
    train: Dataset({
        features: ['input', 'target', '__index_level_0__'],
        num_rows: 45011
    })
    validation: Dataset({
        features: ['input', 'target', '__index_level_0__'],
        num_rows: 2501
    })
    test: Dataset({
        features: ['input', 'target', '__index_level_0__'],
        num_rows: 2501
    })
})


## 3 · Token‑length Statistics  
Before padding/truncation, inspect sequence lengths to decide `max_length` for **inputs** and **targets**.  
Write a helper `length_stats()` that returns *min, max, mean, 95‑percentile*.  


In [4]:
# TODO ↓
tokenizer = AutoTokenizer.from_pretrained('google/mt5-base', use_fast=False)

def length_stats(texts):
    """Return descriptive statistics over tokenised length."""
    # YOUR CODE HERE
    lengths = [len(tokenizer(text, truncation=False)['input_ids']) for text in texts]
    return {
        'min': int(np.min(lengths)),
        'max': int(np.max(lengths)),
        'mean': float(np.mean(lengths)),
        '95%': int(np.percentile(lengths, 95))
    }
    #raise NotImplementedError

input_stats  = length_stats(dataset['train']['input'])
target_stats = length_stats(dataset['train']['target'])

print('Input stats :', input_stats)
print('Target stats:', target_stats)

# Decide sensible values
MAX_SOURCE_LEN = input_stats['95%'] + 5  # small buffer  # TODO
MAX_TARGET_LEN = target_stats['95%'] + 5                 # TODO


tokenizer_config.json:   0%|          | 0.00/376 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/702 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Input stats : {'min': 3, 'max': 146, 'mean': 22.698562573593122, '95%': 45}
Target stats: {'min': 4, 'max': 150, 'mean': 24.663415609517674, '95%': 48}


### Tokenisation function  
Complete `preprocess_function` so that it returns `input_ids`, `attention_mask`, and `labels` truncated/padded to the lengths chosen above.

In [5]:
# TODO ↓
def preprocess_function(batch):
    # YOUR CODE HERE
    # Tokenize the inputs
    inputs = tokenizer(batch["input"], padding="max_length", truncation=True, max_length=MAX_SOURCE_LEN)
    
    # Tokenize the targets
    with tokenizer.as_target_tokenizer():
        targets = tokenizer(batch["target"], padding="max_length", truncation=True, max_length=MAX_TARGET_LEN)

    # Attach labels
    model_inputs = {
        "input_ids": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"],
        "labels": targets["input_ids"]
    }
    return model_inputs

tokenised_ds = dataset.map(preprocess_function, batched=True, remove_columns=dataset['train'].column_names)
tokenised_ds

Map:   0%|          | 0/45011 [00:00<?, ? examples/s]



Map:   0%|          | 0/2501 [00:00<?, ? examples/s]

Map:   0%|          | 0/2501 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 45011
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2501
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2501
    })
})

## 4 · Model & LoRA Configuration  
Instantiate *mT5‑base* and wrap it with **LoRA**.  
Read the LoRA paper and, based on its insights and your available GPU resources, experiment with the *rank r*, `lora_alpha`, and target modules.”


In [6]:
# TODO ↓
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    # r=,                     # TODO: tune
    r=8, #lora size -- usually 2 , 4, 8
    # lora_alpha=,            # TODO: tune
    lora_alpha= 32, # scaling factor(ususally a mutiple of r)
    # target_modules=,        # TODO: tune
    target_modules = ["q", "v"], # lora in attention query & value projections
    lora_dropout=0.10,
    bias='none',
    task_type='SEQ_2_SEQ_LM'
)

base_model = AutoModelForSeq2SeqLM.from_pretrained('google/mt5-base')
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()

pytorch_model.bin:   0%|          | 0.00/2.33G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

trainable params: 884,736 || all params: 583,286,016 || trainable%: 0.1517


## 5 · Fine‑tuning  
Define `Seq2SeqTrainingArguments` and train for **3 epochs**  
Log training loss and evaluate on the validation set each epoch.  


In [7]:
# TODO ↓
training_args = Seq2SeqTrainingArguments(
    # TODO
    output_dir="./mt5-persian-formalizer",         # directory to save model/checkpoints
    evaluation_strategy="epoch",                   # evaluate at end of each epoch
    learning_rate=5e-4,                            # usually higher with LoRA
    per_device_train_batch_size=16,                # adjust based on GPU RAM
    per_device_eval_batch_size=16,
    num_train_epochs=4,                            # adjust as needed
    weight_decay=0.01,                             # small weight decay
    save_total_limit=2,                            # limit checkpoints to save space
    save_strategy="epoch",                         # save model every epoch
    logging_dir="./logs",                          # for TensorBoard
    logging_strategy="steps",
    logging_steps=50,
    predict_with_generate=True,                    # necessary for seq2seq tasks
    generation_max_length=MAX_TARGET_LEN,          # for validation generation
    fp16=True,                                     # enable mixed precision if using a GPU that supports it
    report_to="none"
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding='longest')

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenised_ds['train'],
    eval_dataset=tokenised_ds['validation'],
    data_collator=data_collator
)

# 🚀 Train
trainer.train()  # ← uncomment when ready

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,1.0597,0.543511
2,0.5668,0.361606
3,0.4949,0.316421
4,0.4437,0.304661




TrainOutput(global_step=5628, training_loss=0.9977569666498506, metrics={'train_runtime': 6677.3519, 'train_samples_per_second': 26.963, 'train_steps_per_second': 0.843, 'total_flos': 2.11299223578624e+16, 'train_loss': 0.9977569666498506, 'epoch': 4.0})

## 6 · Inference  
Generate the *formal* version of **5 custom informal sentences** using **greedy decoding** *and* your `MAX_TARGET_LEN`.  


In [8]:
# 6 · Inference  
# Generate the formal version of 5 custom informal sentences using greedy decoding and MAX_TARGET_LEN
import torch

example_inputs = [
    "واسه چی اینقدر دیر اومدی؟",
    "برو اونور وایسا!",
    "خیلی باحالی داداش!",
    "نمیدونم چرا اینجوری شد.",
    "یه چیزی بپرسم؟"
]

# Preprocess inputs
inputs = tokenizer(
    example_inputs,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=MAX_SOURCE_LEN
).to(model.device)

# Generate outputs using greedy decoding
with torch.no_grad():
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=MAX_TARGET_LEN,
        do_sample=False  # greedy decoding
    )

# Decode predictions
formal_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

# Display results
for informal, formal in zip(example_inputs, formal_outputs):
    print(f"[INFORMAL] {informal}")
    print(f"[FORMAL  ] {formal}\n")


[INFORMAL] واسه چی اینقدر دیر اومدی؟
[FORMAL  ] برای چه این قدر دیر آمده ام؟

[INFORMAL] برو اونور وایسا!
[FORMAL  ] برو آنور وایسا!

[INFORMAL] خیلی باحالی داداش!
[FORMAL  ] خیلی باحالی داداش است.

[INFORMAL] نمیدونم چرا اینجوری شد.
[FORMAL  ] نمی دانم چرا این جوری شد.

[INFORMAL] یه چیزی بپرسم؟
[FORMAL  ] یک چیزی بپرسم؟



In [9]:
# # TODO ↓
# example_inputs = [
#     "واسه چی اینقدر دیر اومدی؟",
#     # add 4 more
# ]

# # Greedy decoding
# # YOUR CODE HERE


## 7 · Evaluation  
Compute **BLEU** on the *test* split and report **perplexity** on *validation*.  
Explain briefly what each metric captures for this task.  


In [10]:
import evaluate , math

import torch

# 1. Generate predictions on test set
test_inputs = tokenizer(
    dataset['test']['input'],
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=MAX_SOURCE_LEN
).to(model.device)

with torch.no_grad():
    test_outputs = model.generate(
        input_ids=test_inputs["input_ids"],
        attention_mask=test_inputs["attention_mask"],
        max_length=MAX_TARGET_LEN,
        do_sample=False
    )

preds = tokenizer.batch_decode(test_outputs, skip_special_tokens=True)
refs = dataset['test']['target']

# 2. Compute BLEU
bleu = evaluate.load('sacrebleu')
bleu_score = bleu.compute(predictions=preds, references=[[r] for r in refs])
print(f"\n🔵 BLEU Score on Test Set: {bleu_score['score']:.2f}")

# 3. Compute perplexity on validation set
import torch.nn.functional as F

def compute_perplexity(model, dataset):
    model.eval()
    losses = []
    for batch in tqdm(DataLoader(dataset, batch_size=8)):
        input_ids = torch.tensor(batch['input_ids']).to(model.device)
        attention_mask = torch.tensor(batch['attention_mask']).to(model.device)
        labels = torch.tensor(batch['labels']).to(model.device)
        with torch.no_grad():
            output = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = output.loss
        losses.append(loss.item())
    mean_loss = np.mean(losses)
    return math.exp(mean_loss)

from torch.utils.data import DataLoader
val_perplexity = compute_perplexity(model, tokenised_ds['validation'])
print(f"🔴 Perplexity on Validation Set: {val_perplexity:.2f}")


OutOfMemoryError: CUDA out of memory. Tried to allocate 2.33 GiB. GPU 0 has a total capacity of 14.74 GiB of which 2.01 GiB is free. Process 3358 has 12.73 GiB memory in use. Of the allocated memory 11.32 GiB is allocated by PyTorch, and 1.21 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
# # TODO ↓
# import evaluate, math

# bleu = evaluate.load('sacrebleu')
# # 1. Generate predictions
# # 2. Compute BLEU and perplexity

## 8 · Stochastic Decoding & Diversity Analysis  

Read *Holtzman et al. 2020* — *The Curious Case of Neural Text Degeneration* — to understand how different **stochastic decoding** strategies (like temperature, top‑k, and top‑p sampling) can lead to generating multiple diverse outputs from the same input prompt.

Implement these decoding strategies and experiment with several input examples to observe how the outputs vary.

In [None]:
# # TODO ↓
# def sample_outputs(prompt: str,
#                    num_return_sequences: int = 5,
#                    temperature: float = 0.7,
#                    top_k: int = 50,
#                    top_p: float = 1.0):
#     """Generate *num_return_sequences* diverse outputs from the fine‑tuned model."""
#     # YOUR CODE HERE

# prompt = "تو مطمئنی که بابا بلده گره دوتائی به کفشم بزنه وقتی که من صبحها میخوام برم مدرسه؟"
# samples = sample_outputs(prompt, num_return_sequences=5, temperature=0.9, top_p=0.95)
# print(*samples, sep='\n---\n')


def sample_outputs(prompt: str,
                   num_return_sequences: int = 5,
                   temperature: float = 0.7,
                   top_k: int = 50,
                   top_p: float = 1.0):
    """Generate *num_return_sequences* diverse outputs from the fine‑tuned model."""
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=MAX_SOURCE_LEN).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_length=MAX_TARGET_LEN,
            do_sample=True,
            top_k=top_k,
            top_p=top_p,
            temperature=temperature,
            num_return_sequences=num_return_sequences,
        )

    return tokenizer.batch_decode(outputs, skip_special_tokens=True)

prompt = "تو مطمئنی که بابا بلده گره دوتائی به کفشم بزنه وقتی که من صبحها میخوام برم مدرسه؟"
samples = sample_outputs(prompt, num_return_sequences=5, temperature=0.9, top_p=0.95)
print(*samples, sep='\n---\n')


## 9 · Discussion 

1. How did LoRA hyper‑parameters influence training stability or performance?  
2. **Deterministic vs. Stochastic Decoding**  
   Briefly explain what deterministic decoding (e.g. greedy search, beam search) and stochastic decoding (e.g. temperature sampling, top‑k/top‑p nucleus sampling) mean, drawing on Holtzman et al. 2020, *The Curious Case of Neural Text Degeneration*.
3. Suggest one improvement to the data or model that could further boost formalisation quality.  


---

### Submission Checklist ✅

- [ ] All `TODO` blocks completed.  
- [ ] Notebook runs end‑to‑end without errors (`Runtime ⇾ Restart & Run All`).  
- [ ] Answers written in the *Discussion* section.  

Good luck, and have fun experimenting! ✨
