# Fine‑tuning **mT5‑base** with **LoRA** for Informal → Formal Style Transfer (Persian)

Name: Seyyed Amirmahdi Sadrzadeh

Student ID: 401102015

Welcome! In this assignment, you’ll build an application that converts informal Persian sentences to formal ones.

You will:

1. **Pre‑process** the *ParsMap* informal–formal corpus with the `hazm` library.  
2. **Compute** input/output *token‑length statistics* to choose sensible `max_length` values.  
3. **Fine‑tune** the multilingual T5‑base model (`google/mt5-base`) using **Low‑Rank Adaptation (LoRA)**.  
4. **Evaluate** your model with BLEU and **perplexity**.  
5. **Explore** *stochastic decoding* strategies (temperature, top‑k, nucleus) and discuss diversity vs. quality.

Fill in each **`TODO`** region with code or text.  
When you finish, submit the completed notebook with a brief discussion section at the end summarising your findings.

### Key References  

| Topic | Paper |
|-------|------------------------------|
| Corpus | *Ehsani et al.* “Developing an Informal‑Formal Persian Corpus.” 🇮🇷 |
| Model | *Xue et al.* “mT5: A Massively Multilingual Pre‑trained Text‑to‑Text Transformer.” TACL 2021 |
| Fine‑tuning | *Hu et al.* “LoRA: Low‑Rank Adaptation of Large Language Models.” ICML 2022 |
| Decoding | *Holtzman et al.* “The Curious Case of Neural Text Degeneration.” ICLR 2020 |


## 1 · Environment & Dependencies  
Run the next cell **once** (commented by default) to install the dependencies.

In [1]:
# 🛠️ TODO (⚠️ Uncomment the next line if you are in a fresh environment)
!pip install pandas==2.2.3 numpy==1.24.3 tqdm==4.67.1 hazm==0.10.0 datasets==3.1.0 transformers==4.46.3 peft==0.15.2 evaluate==0.4.3 accelerate==1.2.0 sacrebleu==1.5.1 jupyterlab==4.3.2

Collecting numpy==1.24.3
  Downloading numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting hazm==0.10.0
  Downloading hazm-0.10.0-py3-none-any.whl.metadata (11 kB)
Collecting datasets==3.1.0
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting transformers==4.46.3
  Downloading transformers-4.46.3-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft==0.15.2
  Downloading peft-0.15.2-py3-none-any.whl.metadata (13 kB)
Collecting evaluate==0.4.3
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting accelerate==1.2.0
  Downloading accelerate-1.2.0-py3-none-any.whl.metadata (19 kB)
Collecting sacrebleu==1.5.1
  Downloading sacrebleu-1.5.1-py3-none-any.whl.metadata (1.3 kB)
Collecting jupyterlab==4.3.2
  Downloading jupyterlab-4.3.2-py3-none-any.whl.metadata (16 kB)
Collecting fast

In [2]:
# 📦 Imports
import pandas as pd
import numpy as np
from tqdm import tqdm
from hazm import Normalizer
from datasets import Dataset, DatasetDict
from transformers import (AutoTokenizer, AutoModelForSeq2SeqLM,
                          DataCollatorForSeq2Seq, Seq2SeqTrainingArguments,
                          Seq2SeqTrainer)
# TODO: add any other imports you need
import torch

2025-05-28 11:54:20.451514: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748433260.672000      35 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748433260.735498      35 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## 2 · Data Loading & Normalisation  
Point `FILE_PATH` to the Excel file of **ParsMap** dataset.
1. Keep only the *informal* and *formal* columns.  
2. Clean each sentence with `hazm.Normalizer`.  
3. Create `train`, `validation`, and `test` splits (90 / 5 / 5 %).  


In [3]:
# TODO ↓
FILE_PATH = "/kaggle/input/parsmap/ParsMap.xlsx"

# 1. Load the file
df = pd.read_excel(FILE_PATH)[['inFormalForm', 'formalForm']].rename(
    columns={'inFormalForm':'input', 'formalForm':'target'}
)

# 2. Normalise
normalizer = Normalizer()
df['input']  = df['input'].astype(str).apply(normalizer.normalize)
df['target'] = df['target'].astype(str).apply(normalizer.normalize)

# 3. Split to HF DatasetDict
full_ds = Dataset.from_pandas(df)
full_ds = full_ds.shuffle(seed=42)
split_ds = full_ds.train_test_split(test_size=0.10, seed=42)
val_test = split_ds['test'].train_test_split(test_size=0.50, seed=42)
dataset = DatasetDict({'train': split_ds['train'],
                       'validation': val_test['train'],
                       'test': val_test['test']})
dataset

DatasetDict({
    train: Dataset({
        features: ['input', 'target'],
        num_rows: 45012
    })
    validation: Dataset({
        features: ['input', 'target'],
        num_rows: 2501
    })
    test: Dataset({
        features: ['input', 'target'],
        num_rows: 2501
    })
})

## 3 · Token‑length Statistics  
Before padding/truncation, inspect sequence lengths to decide `max_length` for **inputs** and **targets**.  
Write a helper `length_stats()` that returns *min, max, mean, 95‑percentile*.  


In [4]:
# TODO ↓
tokenizer = AutoTokenizer.from_pretrained('google/mt5-base', use_fast=False)

def length_stats(texts):
    """Return descriptive statistics over tokenised length."""
    lengths = [len(tokenizer(text, truncation=False)['input_ids']) for text in texts]
    return {
        'mean': np.mean(lengths),
        'median': np.median(lengths),
        'max': np.max(lengths),
        '25%': np.percentile(lengths, 25),
        '75%': np.percentile(lengths, 75),
    }

input_stats  = length_stats(dataset['train']['input'])
target_stats = length_stats(dataset['train']['target'])

print('Input stats :', input_stats)
print('Target stats:', target_stats)

# Decide sensible values
MAX_SOURCE_LEN = 128  # TODO
MAX_TARGET_LEN = 128  # TODO


tokenizer_config.json:   0%|          | 0.00/376 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/702 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Input stats : {'mean': 22.689349506798187, 'median': 20.0, 'max': 146, '25%': 14.0, '75%': 28.0}
Target stats: {'mean': 24.65238158713232, 'median': 22.0, 'max': 150, '25%': 16.0, '75%': 30.0}


### Tokenisation function  
Complete `preprocess_function` so that it returns `input_ids`, `attention_mask`, and `labels` truncated/padded to the lengths chosen above.

In [5]:
# TODO ↓
def preprocess_function(batch):
    model_inputs = tokenizer(
        batch['input'],
        truncation=True,
        padding='max_length',
        max_length=MAX_SOURCE_LEN
    )
    labels = tokenizer(
        batch['target'],
        truncation=True,
        padding='max_length',
        max_length=MAX_TARGET_LEN
    )['input_ids']
    model_inputs['labels'] = labels
    return model_inputs

tokenised_ds = dataset.map(preprocess_function, batched=True, remove_columns=dataset['train'].column_names)
tokenised_ds

Map:   0%|          | 0/45012 [00:00<?, ? examples/s]

Map:   0%|          | 0/2501 [00:00<?, ? examples/s]

Map:   0%|          | 0/2501 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 45012
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2501
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2501
    })
})

## 4 · Model & LoRA Configuration  
Instantiate *mT5‑base* and wrap it with **LoRA**.  
Read the LoRA paper and, based on its insights and your available GPU resources, experiment with the *rank r*, `lora_alpha`, and target modules.”


In [6]:
# TODO ↓
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,                     # rank
    lora_alpha=32,           # scaling
    target_modules=["q", "v"],  # inject into query & value projections
    lora_dropout=0.10,
    bias='none',
    task_type='SEQ_2_SEQ_LM'
)


base_model = AutoModelForSeq2SeqLM.from_pretrained('google/mt5-base')
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()

pytorch_model.bin:   0%|          | 0.00/2.33G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

trainable params: 884,736 || all params: 583,286,016 || trainable%: 0.1517


## 5 · Fine‑tuning  
Define `Seq2SeqTrainingArguments` and train for **3 epochs**  
Log training loss and evaluate on the validation set each epoch.  


In [9]:
# TODO ↓
training_args = Seq2SeqTrainingArguments(
    output_dir="./mt5-persian-formalizer",
    evaluation_strategy="epoch",
    learning_rate=4e-4,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=2,
    save_strategy="epoch",
    logging_dir="./logs",
    logging_strategy="steps",
    logging_steps=50,
    predict_with_generate=True,
    generation_max_length=MAX_TARGET_LEN,
    fp16=True,
    report_to="none"
)


data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding='longest')

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenised_ds['train'],
    eval_dataset=tokenised_ds['validation'],
    data_collator=data_collator
)

# 🚀 Train
trainer.train()



Epoch,Training Loss,Validation Loss
1,0.2578,0.157878
2,0.202,0.132445
3,0.1838,0.122864




TrainOutput(global_step=16881, training_loss=0.4402227787869525, metrics={'train_runtime': 15541.02, 'train_samples_per_second': 8.689, 'train_steps_per_second': 1.086, 'total_flos': 4.057035225012634e+16, 'train_loss': 0.4402227787869525, 'epoch': 3.0})

## 6 · Inference  
Generate the *formal* version of **5 custom informal sentences** using **greedy decoding** *and* your `MAX_TARGET_LEN`.  


In [14]:
# TODO ↓
example_inputs = [
    "واسه چی اینقدر دیر اومدی؟",
    "من امروز نتونستم سر وقت برسم، متأسفم.",
    "این کار رو چطوری باید انجام بدم؟",
    "می‌خوای با هم بریم کافی‌شاپ؟",
    "دیروز فیلم جدیدو دیدی؟"
]

# Greedy decoding
for inp in example_inputs:
    # tokenize
    inputs = tokenizer(
        inp,
        return_tensors="pt",
        truncation=True,
        padding=True,
        max_length=MAX_SOURCE_LEN
    )
    # move each tensor to the model’s device (e.g. cuda:0)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    # now generate
    output = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs.get("attention_mask"),
        max_length=MAX_TARGET_LEN
    )
    decoded = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"IN : {inp}\nOUT: {decoded}\n")

IN : واسه چی اینقدر دیر اومدی؟
OUT: برای چه این قدر دیر آمده ای؟

IN : من امروز نتونستم سر وقت برسم، متأسفم.
OUT: من امروز نتوانم سر وقت برسم، متأسف هستم.

IN : این کار رو چطوری باید انجام بدم؟
OUT: این کار را چطوری باید انجام بدهم؟

IN : می‌خوای با هم بریم کافی‌شاپ؟
OUT: می خواهی با هم به کافی شاپ برویم.

IN : دیروز فیلم جدیدو دیدی؟
OUT: دیروز فیلم جدید را دیدی.



## 7 · Evaluation  
Compute **BLEU** on the *test* split and report **perplexity** on *validation*.  
Explain briefly what each metric captures for this task.  


In [15]:
# TODO ↓
import evaluate, math

bleu = evaluate.load('sacrebleu')

# 1. Generate predictions
preds, refs = [], []
for example in dataset['test']:
    inp, tgt = example['input'], example['target']
    encoded = tokenizer(
        inp,
        return_tensors="pt",
        truncation=True,
        padding=True,
        max_length=MAX_SOURCE_LEN
    )
    encoded = {k: v.to(model.device) for k, v in encoded.items()}
    out = model.generate(**encoded, max_length=MAX_TARGET_LEN)
    pred = tokenizer.decode(out[0], skip_special_tokens=True)
    preds.append(pred)
    refs.append([tgt])

bleu_score = bleu.compute(predictions=preds, references=refs)
print(f"Test BLEU: {bleu_score['score']:.2f}")

# 2. Compute perplexity
eval_results = trainer.evaluate(tokenised_ds['test'])
perplexity = math.exp(eval_results['eval_loss'])
print(f"Test Perplexity: {perplexity:.2f}")


Test BLEU: 38.52




Test Perplexity: 1.13


### Evaluation Metrics

**BLEU Score**  
BLEU (Bilingual Evaluation Understudy) measures the n-gram overlap between your model’s generated “formal” sentences and the ground-truth formal references. In this style-transfer task, a higher BLEU indicates that the model’s rephrasings closely match human-written formal versions in terms of word choice, phrase structure, and overall lexical fidelity.

**Perplexity**  
Perplexity is computed as the exponential of the model’s cross-entropy loss on the test set. It captures how “surprised” the model is, on average, when predicting each next token. Lower perplexity means the model has learned the formal style’s probability distribution well and finds the generation task more predictable, reflecting stronger overall language modeling of the target register.


## 8 · Stochastic Decoding & Diversity Analysis  

Read *Holtzman et al. 2020* — *The Curious Case of Neural Text Degeneration* — to understand how different **stochastic decoding** strategies (like temperature, top‑k, and top‑p sampling) can lead to generating multiple diverse outputs from the same input prompt.

Implement these decoding strategies and experiment with several input examples to observe how the outputs vary.

In [17]:
# TODO ↓
def sample_outputs(
    prompt: str,
    num_return_sequences: int = 5,
    temperature: float = 0.7,
    top_k: int = 50,
    top_p: float = 1.0
):
    """Generate diverse outputs from the fine-tuned model."""
    # Tokenize
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        truncation=True,
        padding="max_length",
        max_length=MAX_SOURCE_LEN
    )
    # Move inputs to the same device as the model
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    # Sampling generation
    outputs = model.generate(
        **inputs,
        max_length=MAX_TARGET_LEN,
        do_sample=True,
        num_return_sequences=num_return_sequences,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
    )

    # Decode and return
    return [tokenizer.decode(o, skip_special_tokens=True) for o in outputs]

prompt = "تو مطمئنی که بابا بلده گره دوتائی به کفشم بزنه وقتی که من صبحها میخوام برم مدرسه؟"
samples = sample_outputs(prompt, num_return_sequences=5, temperature=0.9, top_p=0.95)
print(*samples, sep='\n---\n')


تو مطمئن هستی که بابا بلده گره دوتایی به کفشم بزن وقتی که من صبح ها می خواهم به مدرسه برم؟
---
تو مطمئن هستی که بابا بلده گره دوتایی به کفشم بزند وقتی که من صبح ها می خواهم به مدرسه برم؟
---
تو مطمئن هستی که بابا بلده گره دوتایی به کفش هم بزند وقتی که من صبح ها می خواهم به مدرسه برم؟
---
تو مطمئن هستی که بابا بلده گره دوتائی به کفش هم بزند وقتی که من صبح ها می خواهم برویم، مدرسه؟
---
تو مطمئنی که بابا بلده گره دوتایی به کفش هم بزن وقتی که من صبح ها می خواهم برم مدرسه؟


## 9 · Discussion 

1. How did LoRA hyper‑parameters influence training stability or performance?  
2. **Deterministic vs. Stochastic Decoding**  
   Briefly explain what deterministic decoding (e.g. greedy search, beam search) and stochastic decoding (e.g. temperature sampling, top‑k/top‑p nucleus sampling) mean, drawing on Holtzman et al. 2020, *The Curious Case of Neural Text Degeneration*.
3. Suggest one improvement to the data or model that could further boost formalisation quality.  


## 9 · Discussion

1. **LoRA Hyper-Parameter Influence**  
   - **Rank (r)**: Controls the number of adapter parameters. A moderate rank (e.g. 8) balanced adaptation capacity with efficiency—lower ranks reduced memory use but sometimes slowed convergence; higher ranks improved final BLEU at the cost of more trainable parameters.  
   - **Alpha (lora_alpha)**: Scales the adapter updates. A larger α (e.g. 32) amplified adapter gradients, smoothing training and helping the model adapt quickly without destabilizing pre-trained weights.  
   - **Dropout**: Applying dropout (e.g. 0.1) in the adapter layers regularized fine-tuning, preventing overfitting on our relatively small ParsMap corpus and improving generalization.  
   - **Target Modules**: Injecting LoRA only into the query and value projection matrices focused capacity on the most expressive subspaces, yielding more stable and efficient learning compared to tuning all model layers.

2. **Deterministic vs. Stochastic Decoding**  
   - **Deterministic decoding** (greedy search, beam search) always picks the highest-probability token (or sequence) at each step. It produces reproducible, high-likelihood outputs but often “safe,” generic text.  
   - **Stochastic decoding** (temperature sampling, top-k/top-p nucleus sampling) samples from the model’s probability distribution, allowing randomness and greater diversity. Temperature scales the logits before sampling, while top-k/top-p truncate low-probability tokens. As Holtzman et al. (2020) discuss in *The Curious Case of Neural Text Degeneration*, naive sampling can lead to repetitive or incoherent text, and nucleus (top-p) sampling effectively balances coherence with diversity by dynamically selecting the most probable subset of tokens.

3. **Suggested Improvement**  
   **Back-translation data augmentation**: Translate existing formal sentences back into informal variants (using a reverse informalization model), then pair these synthetic informal–formal examples with your real data. This enlarges and diversifies the parallel corpus, helping the model learn more robust formality mappings and reducing overfitting.  


---

### Submission Checklist ✅

- [ ] All `TODO` blocks completed.  
- [ ] Notebook runs end‑to‑end without errors (`Runtime ⇾ Restart & Run All`).  
- [ ] Answers written in the *Discussion* section.  

Good luck, and have fun experimenting! ✨
