## Fine-tune Minerva3B-base 
To fine-tune Minerva3B-base we make use of the **unsloth** library. 
**NOTE**: The notebook was prepared to run on Google Colab. 

In [None]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.55.4

In [None]:
from unsloth import FastLanguageModel
from transformers import TextStreamer
import json
from huggingface_hub import HfApi, login
import torch
import os
from google.colab import drive
from datasets import load_from_disk
import pandas as pd
from datasets import Dataset
from trl import SFTTrainer, SFTConfig
import re

Mount drive

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

To import `Minerva3B-base` you'll need to generate the token and log in to hugging-face.

In [None]:
from google.colab import userdata
from huggingface_hub import login
hf_token = userdata.get('HF_TOKEN')
login(token=hf_token)

#### Unsloth

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "sapienzanlp/Minerva-3B-base-v1.0",
    max_seq_length = 1024,
    load_in_4bit = True,    # ← saves memory
    load_in_8bit = False,
    full_finetuning = False,   # we'll perform partial fine-tuning using PEFT
)

#### PEFT
**PEFT** (Parameter-Efficient Fine-Tuning) is a set of technqiues with aim of efficiently fine tuning large language models. The idea, proposed by PEFT methods is that of updating only a subset of parameters, instead of all model's parameters, which in many situations can be prohibitive.

**Low-Rank Adaption (LoRA)** is one of the most popular PEFT methods. It consists in freezing the original model's pretrained weights, and then using two samll matrices (called **update matrices**) to adapt the new data while keeping the overall number of parameters low.
Now we are going to add LoRA adapters.

Practically this means that we update only a samll number of parameters (1% to 10%)

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

### Data Prep

In the following we load and prepare the dataset using the alpaca format:
```
"""Di seguito viene fornita un'istruzione che descrive uno specifico task, seguita dall'input dello user. Scrivi una risposta che completa la richiesta.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

```

In [1]:
from datasets import load_from_disk

In [2]:
train_dataset = load_from_disk("../datasets/t5-datasets/train")
test_dataset = load_from_disk("../datasets/t5-datasets/test")

In [4]:
print(train_dataset)
print(test_dataset)

Dataset({
    features: ['ocr', 'clean'],
    num_rows: 1804
})
Dataset({
    features: ['ocr', 'clean'],
    num_rows: 244
})


In [None]:
alpaca_prompt = """Di seguito viene fornita un'istruzione che descrive uno specifico task, seguita dall'input dello user. Scrivi una risposta che completa la richiesta.

### Instruction:
{}

### Input:
{}

### Response:
{}"""



EOS = tokenizer.eos_token   # end-of-sequence token, we must add it the end to stop generation

def prepare_ocr_dataset(examples):
    instruction = "Correggi il testo ocr fornito in input dall'utente"
    ocr_samples = examples["ocr"]
    clean_samples = examples['clean']

    texts = []
    for ocr, clean in zip(ocr_samples, clean_samples):
        text = alpaca_prompt.format(instruction, ocr, clean) + EOS

        texts.append(text)

    return {"text" : texts, }

In [None]:
formatted_train_dataset = train_dataset.map(prepare_ocr_dataset, batched=True)

### Train the model

In [None]:
trainer_ocr = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = formatted_train_dataset,
    dataset_text_field = "text",
    eval_dataset = None,
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 20,  # ~ 10-20% of the total steps
        #num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 200,
        learning_rate = 5e-5,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "wandb", # to report training stats on wnadb

    ),
)

In [None]:
trainer_stats = trainer_ocr.train()

### Inference

For the inference we prepare the dataset using the alpaca format as before, but leaving blank the **Response** field to let the model generate the ansewr. So we have:
```
"""Di seguito viene fornita un'istruzione che descrive uno specifico task, seguita dall'input dello user. Scrivi una risposta che completa la richiesta.

### Instruction:
{}

### Input:
{}

### Response: ""
"""
```
Then we write a simple regular expression to retireve the *input* and the model response.

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
ocr_samples = test_dataset['ocr']
clean_samples = test_dataset['clean']

pattern = r"### Instruction:\s*(.*?)\s*### Input:\s*(.*?)\s*### Response:\s*(.*?)</s>"
instruction = "Correggi il testo ocr fornito in input dall'utente"

results = []
i = 0
for ocr, clean in zip(ocr_samples, clean_samples):
    i += 1
    inputs = tokenizer([
        alpaca_prompt.format(
            instruction,
            ocr,
            "", # we leave this blank for generation
        )
    ], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
    answer = tokenizer.batch_decode(outputs)

    match = re.search(pattern, answer[0])
    if match:
        input = match.group(2).strip()
        response = match.group(3).strip()

        d = {}
        d['in'] = input
        d['hyp'] = response
        d['ref'] = clean

        results.append(d)

        if i < 5:
            print(d)

Save the model's answers.

In [None]:
with open("../results/minerva3B-base/minerva3B-base.json", "w", encoding="utf-8") as f:
    json.dump(results, f, ensure_ascii=False, indent=2)

### Push to Hugging Face