# Come addestrare un LLM sui tuoi dati! - Tutorial Giovanni Corigliano
Obiettivo: dimostrare come si può addestrare un LLM open-source privato sui propri dati per poi esportarlo in GGFU per Ollama.

Tutorial realizzato da **Giovanni Corigliano**:[Linkedin](https://www.linkedin.com/in/giovanni-corigliano-9819b7225/)


Seguimi e lascia un like !


Il notebook mostra come:

- Installare Unsloth e le sue dipendenze.

- Caricare un modello Llama-3 da Unsloth Hub.

- Applicare adattatori LoRA per il fine-tuning efficiente.

- Testare rapidamente l’inferenza con template di chat.

- Preparare un dataset di conversazioni da CSV.

- Configurare e lanciare un trainer SFT su 🤗 Datasets.

- Ottimizzare il modello per inferenza e esportarlo in formato GGUF.



**Nota bene**: se vuoi usare la GPU gratuita vai da Modifica -> Impostazioni blocco note -> T4 GPU

# Installazione delle librerie

In [1]:
%%capture
#%%capture: Suppresses shell command output to keep your notebook clean.
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

#Unsloth
Unsloth semplifica l'addestramento di modelli come Llama 3 in locale o su piattaforme come Google Colab e Kaggle. Semplifica l'intero flusso di lavoro di addestramento, inclusi caricamento, quantizzazione, addestramento, valutazione, esecuzione, salvataggio, esportazione e integrazione con motori di inferenza come Ollama, llama.cpp e vLLM.

Il Fine-tuning di un LLM specializza il suo comportamento, migliora o aumenta le sue conoscenze, ottimizza le performance su domini specifici o task particolari.

Esempi:

- Addestrare LLM a prevedere se un titolo ha un impatto positivo o negativo su un'azienda.

- Utilizzare lo storico delle interazioni con i clienti per risposte più accurate e personalizzate.

- Ottimizzare LLM su testi legali per l'analisi contrattuale, la ricerca giurisprudenziale e la conformità.

Si può pensare a un modello con fine-tuning come a un agente specializzato progettato per svolgere compiti specifici in modo più efficace ed efficiente. L'ottimizzazione può replicare tutte le capacità di RAG, ma non viceversa.

https://unsloth.ai/

https://docs.unsloth.ai/

# Inizializzazione del modello

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Choose True to use 4bit quantization to reduce memory usage. False is 16bit full precision.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.5.10: Fast Llama patching. Transformers: 4.52.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/3.83k [00:00<?, ?B/s]

## LORA
**LoRA** (Low-Rank Adaptation) è una tecnica di fine-tuning che:
- **Non modifica** i pesi originali del modello base.
- **Aggiunge** piccole matrici di adattamento $\Delta W = A B$ di rango basso su alcuni layer.
- **Allena** solo queste matrici $A \in \mathbb{R}^{d \times r}$ e $B \in \mathbb{R}^{r \times d}$, con $r \ll d$, riducendo drasticamente il numero di parametri da aggiornare.

Matematicamente, per un layer lineare con peso $W \in \mathbb{R}^{d \times d}$:
$$
W_{\text{eff}} \;=\; W \;+\; \underbrace{A B}_{\Delta W},
\quad A \in \mathbb{R}^{d \times r}, \; B \in \mathbb{R}^{r \times d}.
$$
Grazie a $r$ piccolo (es. 16), il totale dei parametri addestrabili è $\mathcal{O}(2 d r)$ anziché $\mathcal{O}(d^2)$.



## PEFT (Parameter-Efficient Fine-Tuning)

Aggiunge matrici di adattamento LoRA solo ai moduli specificati.
In questo modo si aggiornano pochissimi parametri, risparmiando memoria e tempo

## Parametri PEFT

- r: dimensione intermedia delle matrici $A$ e $B$.

- target_modules: elenco dei nomi dei layer dove applicare LoRA (tipicamente le proiezioni nella self-attention e FFN).

- lora_alpha: lo scaling interno—la matrice $\Delta W$ verrà moltiplicata per $\frac{\alpha}{r}$.

- lora_dropout: dropout sul segnale di input ad $A$, utile per regolarizzare.

- bias="none": indica di non applicare adattamento ai termini di bias.

- use_gradient_checkpointing="unsloth": attiva il gradient checkpointing specifico di Unsloth per gestire contesti più lunghi con meno VRAM.

- random_state: seed per l’inizializzazione di $A$ e $B$.

- use_rslora: se abilitato, usa una versione “rank-stabilized” di LoRA.

- loftq_config: configurazione facoltativa per LoFTQ.


https://huggingface.co/docs/diffusers/training/lora


In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)


Unsloth 2025.5.10 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [4]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "flux prompt for a steampunk laboratory?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Son Goku is the main protagonist in the popular Japanese manga and anime series "Dragon Ball," created by Akira Toriyama. Goku is a powerful warrior with extraordinary abilities, known for his incredible strength, agility, and endurance.

Born on Earth, Goku was discovered by Grand Elder Guru, a powerful being from a higher planet, who sensed that he was destined for greatness. Goku's parents, Bardock and Gine, were low-ranking members of the Namekian resistance, and his early life was marked by hardship and tragedy.

As a child, Goku demonstrated exceptional strength and martial arts skills, which caught the attention of King Piccolo


# Formattazione del testo per il formato di LLama3.2

In [5]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

In [7]:
import pandas as pd
df = pd.read_csv("/content/FinetuningLLM/yourdataset.csv")

In [8]:
df.head()

Unnamed: 0,User,Prompt
0,"""flux prompt for a Viking warrior with a black...","""Norse amulet made from black iron with a carb..."
1,flux prompt for a spooky spirit summoning scen...,1920s double-exposure spirit photograph of a b...
2,flux prompt for a steampunk children's car wit...,A blueprint of Steampunk style mini Children's...
3,"""flux prompt for a girl wearing roller skates""","""A cute girl below the knee, sneakers"""
4,flux prompt for a retro-futuristic spaceship w...,Spaceship retro futurism raygun Gothic style s...


In [9]:
df = df.dropna()

In [10]:
from datasets import Dataset

# Now let's convert the DataFrame into a HuggingFace Dataset, removing the old columns

df["conversations"] = df.apply(
    lambda x: [
        {"content": x["User"], "role": "user"},
        {"content": x["Prompt"], "role": "assistant"}
    ], axis=1
)

dataset = Dataset.from_pandas(df.drop(columns=["User", "Prompt"]))

In [11]:
dataset

Dataset({
    features: ['conversations', '__index_level_0__'],
    num_rows: 287
})

In [12]:
dataset['conversations'][0]

[{'content': '"flux prompt for a Viking warrior with a black iron amulet"',
  'role': 'user'},
 {'content': '"Norse amulet made from black iron with a carbon fiber necklace, matte painting, Unreal Engine, --ar 16:9"',
  'role': 'assistant'}]

In [13]:
def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

# typical formatting for fine tuning

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/287 [00:00<?, ? examples/s]

Map:   0%|          | 0/287 [00:00<?, ? examples/s]

In [14]:
dataset[5]["conversations"]

[{'content': '"flux prompt for a Melbourne tram in Avatar movie style with blue and green color palette"',
  'role': 'user'},
 {'content': '"[Melbourne\'s Flinders Street] by [Avatar Movie] art style::20 bubbles::10 aquarium scene::10 lights::10 One Melbourne Tram::15 blue and green color palette::10 ultra-detail, wide-angle, sharp look, high detail, epic lighting, vivid light refractions, photorealistic, ultra-realistic, photo-realistic, fit in the screen, rule of thirds::8 —ar 16:9"',
  'role': 'assistant'}]

In [15]:
dataset[5]["text"]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n"flux prompt for a Melbourne tram in Avatar movie style with blue and green color palette"<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"[Melbourne\'s Flinders Street] by [Avatar Movie] art style::20 bubbles::10 aquarium scene::10 lights::10 One Melbourne Tram::15 blue and green color palette::10 ultra-detail, wide-angle, sharp look, high detail, epic lighting, vivid light refractions, photorealistic, ultra-realistic, photo-realistic, fit in the screen, rule of thirds::8 —ar 16:9"<|eot_id|>'

# Addestramento del modello

Argomenti principali passati a SFTTrainer:

- model = model
Deve essere già un modello “preparato” per il fine-tuning con LoRA già wrappato se stai usando LoRA, come in questo caso, oppure un semplice modello HF se non usi LoRA.

- tokenizer = tokenizer
Il tokenizer corrispondente al modello: serve per convertire testo in token IDs e viceversa. Deve essere lo stesso usato in fase di inference e compatibile con la checkpoint del modello.

- train_dataset = dataset
L’oggetto dataset (di 🤗 Datasets) che contiene i tuoi esempi. In questo snippet, si assume che dataset sia già un Dataset con la colonna “text” (anche se vedremo il parametro successivo dataset_text_field).

- dataset_text_field = "text"
Indica al SFTTrainer quale campo (colonna) del tuo dataset contiene il testo da usare.Se dataset è un oggetto HuggingFace con, ad esempio, due colonne input e target, potresti invece scrivere "input" e specificare come concatenare prompt e risposta.
Dataset["text"] è una stringa unica già nel formato “prompt + risposta” (es. "Q: ...\nA: ..."). SFTTrainer userà direttamente quella stringa per tokenizzare e calcolare il loss.

- max_seq_length = max_seq_length
Massima lunghezza di sequenza (in token) che il trainer utilizzerà. Qualunque sequenza più lunga verrà troncata a questo valore.
Se nella cella precedente hai definito max_seq_length = 2048, significa che ogni esempio di training viene tokenizzato e poi eventualmente troncato a 2048 token (tipico per modelli Llama che supportano contesti lunghi).

- data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer)
Il “collator” che, per ogni batch, prende una lista di esempi già tokenizzati (in Mech–IDS) e:
1)Allinea (pad) le sequence di input alla stessa lunghezza.
2)Costruisce i labels corretti (riempiendo con -100 dove non va calcolato il loss).
3)Restituisce un dizionario con chiavi come input_ids, attention_mask, labels.
4)Se il tuo dataset è già formato da “prompt” + “risposta” nella stessa stringa text, internamente il collator prenderà tale stringa, la tokenizerà (con le stesse regole del modello) e poi farà il padding.

- dataset_num_proc = 2
Quanti processi paralleli usare per applicare la tokenizzazione/riprocessamento del dataset in streaming.Se fai dataset.map(tokenize_fn, num_proc=2), ci saranno due processi “lavoratori” che tokenizzano ed eventualmente “packano” i dati. Di solito serve per velocizzare l’encoding del dataset prima di passarne i batch al Trainer.

- packing = False
Il “sequence packing” è una tecnica per concatenare sequenze corte in una più lunga in modo da ridurre l’overhead del padding e far girare più dati per step.Se fosse True, SFTTrainer cercherebbe di creare “blocchi” di sequenze corte e concatenarle fino a max_seq_length token, risparmiando tempo e memoria. Qui è disabilitato (False), ma su dataset con testi molto brevi (es. risposte di poche decine di token) abilitando il packing si può ottenere fino a 5× speedup (perché si minimizza lo spazio sprecato dal padding).



In [16]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 50,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    )
)

Unsloth: Tokenizing ["text"]:   0%|          | 0/287 [00:00<?, ? examples/s]

In [17]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

#tag llama

Map (num_proc=2):   0%|          | 0/287 [00:00<?, ? examples/s]

In [18]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
6.779 GB of memory reserved.


In [19]:
trainer_stats = trainer.train()
# training loss shows the error I'm having on the model. I move the weights that represent the interconnections between the neurons of the network

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 287 | Num Epochs = 2 | Total steps = 50
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856/3,237,063,680 (0.75% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,4.963
2,5.0357
3,5.1075
4,4.5018
5,4.5513
6,4.5378
7,4.1776
8,3.7889
9,3.6571
10,3.3547


# Inferenza con Streaming

In [21]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "flux prompt for a steampunk laboratory"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

Steampunk style --uplight --no dof --h 900 --w 1200 --ar 2.5:1 --no perspective, steampunk laboratory with a lot of pipes, copper edges, intricate metalwork, big clock, Brass pipelines, Black metal foil, Black paint, Carbon fiber, Glass bottle filled with green liquid, White paper, Blueprint, Artistic style refer to SHAPESHIFTER CONCEPTS of "Steampunk Lab" --no shadows, Highly detailed, Refer to concept art of SHAPESHIFTER CONCEPTS "Steampunk Laboratory Counter" --no dof, Copper


# Esporto in GGUF per Ollama

In [22]:
# First, ensure you are in the correct directory where you want to save the model.
# However, the compilation instructions suggest running the commands in the same folder
# as you're saving your model. Let's compile in a temporary directory and then try saving.

# 1. Change directory
%cd /content

# 2. Clone llama.cpp with its submodules
!git clone --recursive https://github.com/ggerganov/llama.cpp

# 3. Compile
%cd llama.cpp
!make clean && make all -j$(nproc)

# Now, change back to the directory where you want to save the model
# Make sure this directory exists if it's a Google Drive path
import os
save_dir = "/content/FinetuningLLM/model"
os.makedirs(save_dir, exist_ok=True)
%cd {save_dir}

# Now, try saving the model again.
# You may need to move the compiled llama.cpp executable to a location
# where unsloth can find it, or specify the path.
# Let's try explicitly setting the temporary_location to the compiled llama.cpp directory.
# This is a temporary fix and might require further investigation into unsloth's
# expected location for the executable.
import sys
sys.path.insert(0, "/content/FinetuningLLM/llama_cpp_build/llama.cpp")


model.save_pretrained_gguf("/content/FinetuningLLM/model", tokenizer, quantization_method = "f16")

# Save to q4_k_m GGUF
# model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")

# Save to 8bit Q8_0
# model.save_pretrained_gguf("model", tokenizer,)

/content
Cloning into 'llama.cpp'...
remote: Enumerating objects: 52390, done.[K
remote: Counting objects: 100% (436/436), done.[K
remote: Compressing objects: 100% (274/274), done.[K
remote: Total 52390 (delta 318), reused 169 (delta 162), pack-reused 51954 (from 4)[K
Receiving objects: 100% (52390/52390), 125.36 MiB | 17.12 MiB/s, done.
Resolving deltas: 100% (37852/37852), done.
Submodule 'kompute' (https://github.com/nomic-ai/kompute.git) registered for path 'ggml/src/ggml-kompute/kompute'
Cloning into '/content/llama.cpp/ggml/src/ggml-kompute/kompute'...
remote: Enumerating objects: 9122, done.        
remote: Counting objects: 100% (155/155), done.        
remote: Compressing objects: 100% (69/69), done.        
remote: Total 9122 (delta 109), reused 86 (delta 86), pack-reused 8967 (from 3)        
Receiving objects: 100% (9122/9122), 17.59 MiB | 16.14 MiB/s, done.
Resolving deltas: 100% (5728/5728), done.
Submodule path 'ggml/src/ggml-kompute/kompute': checked out '4565194ed

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 6.4G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.61 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:01<00:00, 15.70it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving /content/FinetuningLLM/model/pytorch_model-00001-of-00002.bin...
Unsloth: Saving /content/FinetuningLLM/model/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['f16'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at /content/FinetuningLLM/model into f16 GGUF format.
The output location will be /content/FinetuningLLM/model/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model
INFO:hf-to-gguf:Model architecture: LlamaForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin

In [23]:
from google.colab import files
files.download('/content/FinetuningLLM/model/unsloth.F16.gguf')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>