# Session 3: Fine-tuning LLMs for Low-Resource Languages üöÄ

<div align="center">

**üìö Course Repository:** [github.com/NinaKivanani/Tutorials_low-resource-llm](https://github.com/NinaKivanani/Tutorials_low-resource-llm)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NinaKivanani/Tutorials_low-resource-llm/blob/main/3_tutorial.ipynb)
[![GitHub](https://img.shields.io/badge/GitHub-View%20Repository-blue?logo=github)](https://github.com/NinaKivanani/Tutorials_low-resource-llm)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)

</div>

---

**Advanced Parameter-Efficient Fine-Tuning for Low-Resource Languages**

Welcome to **Session 3**! You'll master the art and science of adapting pretrained LLMs to specialized tasks using systematic fine-tuning techniques, with focus on practical applications for low-resource languages.

**üéØ Focus:** Parameter-efficient fine-tuning, LoRA, systematic evaluation  
**üíª Requirements:** GPU recommended (Colab free tier sufficient)  
**üî¨ Methodology:** Production-ready techniques with systematic comparison

## Prerequisites

**üìã Recommended learning path:**
1. **Session 0:** Setup and tokenization analysis ‚úÖ  
2. **Session 1:** Systematic baseline techniques ‚úÖ
3. **Session 2:** Systematic prompt engineering ‚úÖ  
4. **This session (Session 3):** Advanced fine-tuning techniques ‚Üê You are here!

## What You Will Master

1. **üèóÔ∏è Fine-tuning fundamentals** - Full vs. parameter-efficient approaches with cost analysis
2. **‚ö° LoRA and advanced PEFT** - Low-Rank Adaptation with systematic parameter optimization
3. **üìä Instruction tuning** - Task-specific adaptation with systematic evaluation  
4. **üéØ Preference optimization** - Alignment techniques for better outputs
5. **üìà Systematic monitoring** - Training metrics, loss analysis, convergence patterns
6. **üåç Low-resource adaptation** - Strategies for data-scarce languages
7. **üè≠ Production deployment** - Real-world considerations and best practices

## Learning Objectives

By the end of this session, you will:
- ‚úÖ **Distinguish systematically** between full and parameter-efficient fine-tuning approaches
- ‚úÖ **Implement LoRA fine-tuning** with optimal hyperparameter selection  
- ‚úÖ **Monitor training systematically** using multiple metrics and visualizations
- ‚úÖ **Evaluate model improvements** quantitatively across multiple dimensions
- ‚úÖ **Design production pipelines** for low-resource language fine-tuning
- ‚úÖ **Apply cost-benefit analysis** for real-world deployment decisions

## üî¨ Advanced Methodology

**This session uses production-grade practices:**
- **üìä Systematic Comparison:** Multiple fine-tuning approaches with quantitative evaluation
- **üí∞ Cost Analysis:** Resource requirements and ROI calculations for each approach
- **üéØ Task-Specific Evaluation:** Beyond perplexity - task-relevant metrics
- **üåç Cross-Lingual Validation:** Systematic evaluation across language boundaries  
- **üìà Production Readiness:** Deployment considerations and scalability analysis

## How This Session Works

- **üéì Theory ‚Üí Practice ‚Üí Analysis:** Learn concepts ‚Üí Apply systematically ‚Üí Measure results
- **üîß Hands-on Implementation:** Real code, real models, real data
- **üìä Quantitative Evaluation:** Every claim backed by systematic measurement
- **üíº Production Focus:** Techniques you can use in real projects immediately
- **üåç Low-Resource Emphasis:** Special attention to resource-constrained scenarios

**‚ö†Ô∏è Important Note:**  
This is a **production-oriented demonstration** using systematic methodology. While we use a small dataset for speed, all techniques scale to production systems. The focus is on **understanding systematic approaches** and **building production-ready intuitions**.


## 0. üèóÔ∏è Fine-Tuning Fundamentals: Theory and Practice

### 0.1 Fine-Tuning Taxonomy: A Systematic Overview

**Fine-tuning** is the process of adapting a pretrained language model to specialized tasks or domains using additional labeled data. Understanding the landscape of approaches is crucial for making informed decisions.

| **Approach** | **Parameters Updated** | **Memory Requirement** | **Training Speed** | **Best For** | **Cost** |
|--------------|----------------------|----------------------|-------------------|--------------|----------|
| **üî• Full Fine-tuning** | All parameters (100%) | Very High (4x model size) | Slow | High-resource tasks | $$$$$ |
| **‚ö° Parameter-Efficient (PEFT)** | Small subset (0.1-10%) | Low (1.2x model size) | Fast | Low-resource languages | $$ |
| **üéØ LoRA** | Low-rank adapters (~1%) | Very Low | Very Fast | Most practical cases | $ |
| **üìö Instruction Tuning** | Task-specific layers | Medium | Medium | Following instructions | $$$ |
| **üé™ Preference Optimization** | Value/reward layers | Medium | Medium | Human alignment | $$$ |

### 0.2 üî¨ Deep Dive: Parameter-Efficient Fine-Tuning (PEFT)

**Why PEFT Matters for Low-Resource Languages:**

1. **üí∞ Cost Effectiveness:** Train with 1000x less GPU memory
2. **‚ö° Speed:** 10x faster training and deployment  
3. **üõ°Ô∏è Catastrophic Forgetting Prevention:** Preserve original capabilities
4. **üîÑ Task Switching:** Multiple adapters for different tasks
5. **üì¶ Storage Efficiency:** Adapters are ~10MB vs full models at ~10GB

### 0.3 üéØ LoRA (Low-Rank Adaptation) Deep Dive

**Mathematical Foundation:**
```
W = W‚ÇÄ + ŒîW = W‚ÇÄ + BA
```
Where:
- `W‚ÇÄ`: Frozen pretrained weights
- `B`, `A`: Low-rank matrices (rank r << d) 
- `ŒîW = BA`: Learned adaptation with r << original rank

**Key Hyperparameters:**
- **Rank (r):** Higher = more expressive but slower (typical: 4-64)
- **Alpha (Œ±):** Scaling factor for adaptation strength (typical: 16-32) 
- **Target Modules:** Which layers to adapt (attention vs MLP vs both)
- **Dropout:** Regularization for adaptation layers (typical: 0.05-0.1)

### 0.4 üìä Systematic Approach to Fine-Tuning

**Our methodology follows production best practices:**

1. **üß™ Baseline Establishment:** Test pretrained model performance
2. **üìä Systematic Hyperparameter Search:** Grid search over key parameters
3. **üìà Multi-Metric Evaluation:** Beyond perplexity - task-specific metrics
4. **üîç Ablation Studies:** Understand what drives improvements
5. **üíº Production Planning:** Cost analysis and deployment considerations


In [None]:
# If you are in Google Colab, make sure the runtime has a GPU:
# Runtime -> Change runtime type -> Hardware accelerator -> GPU

import torch

if torch.cuda.is_available():
    print("GPU available:", torch.cuda.get_device_name(0))
else:
    print("No GPU detected. Training will be very slow.")


In [None]:
# Install required libraries.
# In Colab, this cell may take a couple of minutes.
!pip install -q transformers datasets accelerate peft

In [None]:
import os
import random
import math
from dataclasses import dataclass
from typing import Dict, List

import torch
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)
from peft import LoraConfig, get_peft_model, TaskType

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

# For reproducibility
SEED = 42
random.seed(SEED)
torch.manual_seed(SEED)
if device == "cuda":
    torch.cuda.manual_seed_all(SEED)


## 1. Load a compact multilingual model

We choose a relatively small chat tuned model so that:

- It fits in Colab GPU memory.
- It supports many languages, including low resource ones reasonably well.

Here we use the TinyLlama chat model (about 1.1B parameters).  
In a real project, the model choice would depend on license constraints, hardware, and language coverage.


In [None]:
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

tokenizer = AutoTokenizer.from_pretrained(model_name)
# Some causal language models do not have a pad token set.
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
)
model.to(device)

# Disable cache during training to avoid warnings.
model.config.use_cache = False

print("Model loaded.")

## 2. Build a tiny low resource toy dataset

We construct a minimal dataset of English to Luxembourgish translation pairs directly in the notebook.  

- We treat Luxembourgish (lb) as the low resource language.  
- In a real project, you would replace this list with real parallel data or task specific instances.  
- The tiny size is intentional so that training finishes in a few minutes for demonstration purposes.


In [None]:
toy_data = [
    {
        "id": 1,
        "language": "lb",
        "source": "Good morning, how are you?",
        "target": "Gudde Moien, w√©i geet et dir?",
    },
    {
        "id": 2,
        "language": "lb",
        "source": "Thank you very much for your help.",
        "target": "Villmools Merci fir deng H√´llef.",
    },
    {
        "id": 3,
        "language": "lb",
        "source": "I would like a coffee with milk, please.",
        "target": "Ech h√§tt g√§r eng Taass Kaffi mat M√´llech, wann ech gelift.",
    },
    {
        "id": 4,
        "language": "lb",
        "source": "Where is the train station?",
        "target": "Wou ass d'Eisebunnsstatioun?",
    },
    {
        "id": 5,
        "language": "lb",
        "source": "Today the weather is very cold.",
        "target": "Haut ass d'Wieder ganz kal.",
    },
    {
        "id": 6,
        "language": "lb",
        "source": "My name is Anna and I live in Luxembourg.",
        "target": "Ech heeschen Anna an ech wunnen zu L√´tzebuerg.",
    },
    {
        "id": 7,
        "language": "lb",
        "source": "Could you please speak a little more slowly?",
        "target": "Kanns du w.e.g. e b√´sse m√©i lues schw√§tzen?",
    },
    {
        "id": 8,
        "language": "lb",
        "source": "I am learning Luxembourgish because I work here.",
        "target": "Ech l√©ieren L√´tzebuergesch, well ech hei schaffen.",
    },
    {
        "id": 9,
        "language": "lb",
        "source": "The next bus arrives in ten minutes.",
        "target": "Den n√§chste Bus k√´nnt an z√©ng Minutten un.",
    },
    {
        "id": 10,
        "language": "lb",
        "source": "This food is delicious.",
        "target": "D√´st Iessen ass lecker.",
    },
    {
        "id": 11,
        "language": "lb",
        "source": "I do not understand, can you repeat that?",
        "target": "Ech verstinn net, kanns du dat widderhuelen?",
    },
    {
        "id": 12,
        "language": "lb",
        "source": "Have a nice evening.",
        "target": "Sch√©inen Owend nach.",
    },
]

dataset = Dataset.from_list(toy_data)
dataset

In [None]:
# Simple split: 75 percent train, 25 percent test.
split_dataset = dataset.train_test_split(test_size=0.25, seed=SEED)
train_dataset = split_dataset["train"]
eval_dataset = split_dataset["test"]

print("Train size:", len(train_dataset))
print("Eval size:", len(eval_dataset))

for example in eval_dataset:
    print(example)

## 3. Define an instruction style prompt template

We wrap each example into a simple instruction prompt so that the model sees:

- A system like description.
- The English sentence.
- A cue to produce the Luxembourgish translation.

For training, we construct a single text sequence that contains both the prompt and the target translation.  
The model learns to generate the full sequence.  
At inference time, we will provide only the prompt and ask the model to continue.


In [None]:
PROMPT_TEMPLATE = (
    "You are a helpful assistant that translates from English to Luxembourgish.\n"
    "Translate the following sentence into Luxembourgish.\n\n"
    "English: {source}\n"
    "Luxembourgish:"
)

def format_example(example: Dict) -> Dict:
    prompt = PROMPT_TEMPLATE.format(source=example["source"])
    full_text = prompt + " " + example["target"]
    return {
        "text": full_text,
        "language": example["language"],
        "id": example["id"],
    }

formatted_train = train_dataset.map(format_example)
formatted_eval = eval_dataset.map(format_example)

for e in formatted_train.select(range(2)):
    print("----")
    print(e["text"])

## 4. Baseline model behaviour before fine tuning

Before we change any parameters, we check how the base TinyLlama model behaves on our evaluation set.

We will:

- Use only the prompt part of each example.
- Let the model generate a continuation.
- Compare the output qualitatively to the target translation.

Keep expectations realistic.  
The base model may already know some Luxembourgish, but it was not trained specifically for this task.


In [None]:
def build_prompt(source_sentence: str) -> str:
    return PROMPT_TEMPLATE.format(source=source_sentence)

def generate_translation(model, tokenizer, source_sentence: str, max_new_tokens: int = 64) -> str:
    prompt = build_prompt(source_sentence)
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
        )
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated

print("### Baseline outputs before fine tuning ###\n")

for example in eval_dataset:
    src = example["source"]
    tgt = example["target"]
    generated = generate_translation(model, tokenizer, src)
    print("English:", src)
    print("Target Luxembourgish:", tgt)
    print("Model output:")
    print(generated)
    print("=" * 60)

## 5. Prepare data for causal language model training

We now convert the formatted text examples into token ids suitable for causal language modeling.

- Each training instance is a sequence of tokens.
- The model will learn to predict the next token given previous tokens.
- For simplicity, we use the same token ids as both `input_ids` and `labels`.

In a more careful setup, you might mask the loss on prompt tokens and only train on the answer part.  
Here we keep the configuration simple so that the mechanics of parameter efficient fine tuning are clear.


In [None]:
MAX_SEQ_LENGTH = 256

def tokenize_function(example: Dict) -> Dict:
    result = tokenizer(
        example["text"],
        truncation=True,
        max_length=MAX_SEQ_LENGTH,
        padding="max_length",
    )
    # For simple language modeling we use the same ids as labels.
    result["labels"] = result["input_ids"].copy()
    return result

tokenized_train = formatted_train.map(tokenize_function, remove_columns=["text", "language", "id"])
tokenized_eval = formatted_eval.map(tokenize_function, remove_columns=["text", "language", "id"])

print(tokenized_train[0])

In [None]:
# Data collator for causal language modeling. No masked language modeling.
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

## 6. Configure LoRA parameter efficient fine tuning

Instead of updating all model parameters, we use LoRA:

- LoRA adds small trainable matrices (low rank adapters) to selected linear layers.
- The base model weights stay frozen.
- This makes fine tuning lighter and more feasible on modest hardware.
- It also reduces the risk of catastrophic forgetting.

We choose a small rank and apply LoRA to attention projection layers only.  
This is a typical starting point for LLaMA like models.


In [None]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules=["q_proj", "v_proj"],  # typical for LLaMA family models
)

peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()

## 7. Training configuration

We set very conservative training hyper parameters:

- Small batch size.
- A few epochs over a tiny dataset.
- No checkpoint saving to keep the run light.
- Logging every step so that you can watch the loss.

In a realistic low resource project you would:

- Use many more examples.
- Run for longer.
- Tune hyper parameters carefully.
- Monitor validation loss and task specific metrics.


In [None]:
output_dir = "tiny_llama_lb_lora"

training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=5,
    learning_rate=2e-4,
    warmup_ratio=0.1,
    logging_steps=1,
    evaluation_strategy="epoch",
    save_strategy="no",
    weight_decay=0.0,
    fp16=(device == "cuda"),
    report_to="none",
)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

print("Trainer created.")

In [None]:
train_result = trainer.train()
print("\nTraining completed.")
print(train_result)

eval_metrics = trainer.evaluate()
print("\nEvaluation metrics:")
print(eval_metrics)

## 8. Compare outputs before and after fine tuning

Now we generate translations again using the fine tuned model.  
We keep the prompts identical and inspect:

- Whether the model is more likely to produce Luxembourgish.
- Whether the translations are closer to our target references.
- Any side effects such as overfitting to the tiny dataset style.


In [None]:
print("### Outputs after LoRA fine tuning ###\n")

for example in eval_dataset:
    src = example["source"]
    tgt = example["target"]
    prompt = build_prompt(src)
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = peft_model.generate(
            **inputs,
            max_new_tokens=64,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
        )
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)

    print("English:", src)
    print("Target Luxembourgish:", tgt)
    print("Model output after fine tuning:")
    print(generated)
    print("=" * 60)

## 9. Quick discussion prompts

Discuss in small groups or write down short notes.

1. **Data size and quality.**  
   - We used 12 examples.  
   - What kinds of errors or biases can appear if we deploy a system trained on such a tiny sample?  
   - How would you scale the dataset for a real project in a low resource setting?

2. **Evaluation.**  
   - We only looked at qualitative outputs and language modeling loss.  
   - Which task specific metrics would you design for a real application such as translation, classification, or dialogue for a low resource language?  
   - How would you build a reliable test set?

3. **Safety and robustness.**  
   - Fine tuning can change model behaviour in unexpected ways.  
   - What additional checks would you perform before using a fine tuned model with real users in a low resource community?

4. **Transfer to your language of interest.**  
   - Suppose you want to adapt the same pipeline to Armenian or another language.  
   - What would you need to change in this notebook?  
   - Which parts are reusable, and which parts are specific to the Luxembourgish toy dataset?

5. **Beyond LoRA.**  
   - Parameter efficient fine tuning is one piece of the puzzle.  
   - What other techniques could you combine with LoRA for low resource languages, for example prompting, retrieval augmented generation, multilingual pre training, or synthetic data generation?

Use these questions to connect the small scale exercise with the broader methodological and ethical questions of building LLMs for low resource languages.


## 10. Optional extensions (if you have time)

If you finish early, you can explore one of these directions:

1. **Add a second language.**  
   Extend the toy dataset with a few examples for another low resource language, for example Armenian or Kurdish, and see how the model behaves.

2. **Loss masking.**  
   Modify the tokenization step so that the loss is computed only on the answer part of each example and not on the prompt.

3. **Temperature sweep.**  
   Generate outputs with different sampling temperatures and reflect on how diversity and correctness trade off.

4. **Save and reload adapters.**  
   Use `peft_model.save_pretrained` to save only the LoRA adapters, then reload them on top of the base model in a fresh notebook.

These small experiments help build intuition about how parameter efficient fine tuning interacts with low resource data and multilingual models.
