# 1. Inspecting and Preparing Datasets for LLM Fine-Tuning

Typical dataset format
- **Generic text**: 
```json
{"text": "..."}
```

- **Instruction-following**: e.g., 

```json
{"instruction": "...", "input": "...", "output": "..."} - Alpaca-style
```
or
```json
{"prompt": "...", "response": "..."} - OpenAI Style
```

- **Multi-turn conversational**:

```json
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}, ...]}
```

In [1]:
from datasets import load_dataset
import json

## 1. Generic text

In [2]:
ds_text = load_dataset("roneneldan/TinyStories")

In [3]:
ds_text

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2119719
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 21990
    })
})

In [4]:
print(json.dumps(ds_text['train'][0], indent=2))

{
  "text": "One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.\n\nLily went to her mom and said, \"Mom, I found this needle. Can you share it with me and sew my shirt?\" Her mom smiled and said, \"Yes, Lily, we can share the needle and fix your shirt.\"\n\nTogether, they shared the needle and sewed the button on Lily's shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together."
}


## 2. Instruction Following

In [5]:
ds_instruction_following = load_dataset("databricks/databricks-dolly-15k")

In [6]:
ds_instruction_following

DatasetDict({
    train: Dataset({
        features: ['instruction', 'context', 'response', 'category'],
        num_rows: 15011
    })
})

In [7]:
print(json.dumps(ds_instruction_following['train'][0], indent=2))

{
  "instruction": "When did Virgin Australia start operating?",
  "context": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.",
  "response": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.",
  "category": "closed_qa"
}


## 3. Conversazione multi-turno

In [8]:
ds_multi_turn = load_dataset("Open-Orca/SlimOrca", split="train[:1000]")

In [9]:
ds_multi_turn

Dataset({
    features: ['conversations'],
    num_rows: 1000
})

In [10]:
print(json.dumps(ds_multi_turn[2], indent=2))

{
  "conversations": [
    {
      "from": "system",
      "value": "You are an AI assistant. You will be given a task. You must generate a detailed and long answer.",
      "weight": null
    },
    {
      "from": "human",
      "value": "Produce a long descriptive sentence that uses all these words: Albuquerque, New Mexico, areaOfLand, 486.2 (square kilometres); Albuquerque, New Mexico, populationDensity, 1142.3 (inhabitants per square kilometre); Albuquerque, New Mexico, isPartOf, Bernalillo County, New Mexico; Albuquerque, New Mexico, areaTotal, 490.9 (square kilometres)",
      "weight": 0.0
    },
    {
      "from": "gpt",
      "value": "Stretching across a vast areaOfLand, totaling 486.2 square kilometres, Albuquerque, New Mexico, boasts a thriving populationDensity of approximately 1142.3 inhabitants per square kilometre, all residing within the expansive city limits which are part of the beautiful Bernalillo County in New Mexico, enveloping an impressive areaTotal of 490.9 

# 2. Create your dataset

In [1]:
import json
import random
from pathlib import Path
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

INSTRUCT_MODEL = "google/gemma-3-4b-it"

# Carichiamo tokenizer
tokenizer = AutoTokenizer.from_pretrained(INSTRUCT_MODEL)

# Carichiamo modello sul device migliore disponibile
if torch.backends.mps.is_available():
    device = "mps"
elif torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

model = AutoModelForCausalLM.from_pretrained(
    INSTRUCT_MODEL,
    torch_dtype="auto",
    device_map="auto"
)

def generate_stepwise_response(prompt, max_new_tokens=500):
    """Genera una risposta con Gemma senza includere l'input."""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=0.7
    )
    
    # Decodifica completa
    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

    print("\t", full_response)
    
    # Rimuove il prompt iniziale (se presente) dalla risposta
    if full_response.startswith(prompt):
        full_response = full_response[len(prompt):].strip()
    
    return full_response

2025-10-27 16:35:22.602695: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Skipping import of cpp extensions due to incompatible torch version 2.8.0+cu128 for torchao version 0.14.1             Please see https://github.com/pytorch/ao/issues/2919 for more info
`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [2]:
concepts = [
    "Come fare una torta di mele",
    "Come costruire una casa sull'albero",
    "Come coltivare un girasole"
]

dataset = []
num_samples = 3

for i in range(num_samples):

    instruction = concepts[i]

    # Prompt
    prompt = (
        f"{instruction}\n\n"
        "Rispondi in modo chiaro e ordinato, spiegando passo dopo passo."
    )

    print(f"Generazione per: {prompt}")
    response = generate_stepwise_response(prompt)

    entry = {
        "instruction": "Rispondi alla seguente istruzione passo dopo passo:\n",
        "input": instruction,
        "output": response
    }
    dataset.append(entry)


# ================================================================
# SALVATAGGIO IN JSONL
# ================================================================
output_path = Path("gemma_step_reasoning_dataset.jsonl")
with output_path.open("w", encoding="utf-8") as f:
    for entry in dataset:
        f.write(json.dumps(entry, ensure_ascii=False) + "\n")

print(f"\nDataset salvato in: {output_path}")
print(f"Esempio primo record:\n{json.dumps(dataset[0], ensure_ascii=False, indent=2)}")

Generazione per: Come fare una torta di mele

Rispondi in modo chiaro e ordinato, spiegando passo dopo passo.
	 Come fare una torta di mele

Rispondi in modo chiaro e ordinato, spiegando passo dopo passo.

Ecco come preparare una deliziosa torta di mele:

**Ingredienti:**

*   **Per l'impasto:**
    *   250g di farina 00
    *   125g di burro freddo, a cubetti
    *   100g di zucchero
    *   1 uovo grande
    *   Un pizzico di sale
    *   Scorza grattugiata di 1/2 limone (facoltativo)
    *   50-75 ml di acqua fredda (se necessario)
*   **Per il ripieno:**
    *   6-8 mele (Golden Delicious, Renetta o Granny Smith sono ottime)
    *   50g di zucchero (regola in base alla dolcezza delle mele)
    *   1 cucchiaino di cannella in polvere
    *   Succo di 1/2 limone
    *   20g di burro a fiocchetti
*   **Per la finitura (facoltativo):**
    *   Zucchero a velo per spolverare

**Preparazione:**

1.  **Prepara l'impasto:**
    *   In una ciotola grande, mescola la farina, lo zucchero e il

In [8]:
dataset[0]

{'instruction': 'Rispondi alla seguente istruzione passo dopo passo:\n',
 'input': 'Spiega passo dopo passo come coltivare un girasole.',
 'output': "Ecco una guida su come preparare una deliziosa torta di mele, con un'attenzione particolare alla chiarezza e all'ordine dei passaggi:\n\n**Ingredienti:**\n\n*   **Per la pasta frolla:**\n    *   250g di farina 00\n    *   125g di burro freddo, tagliato a cubetti\n    *   100g di zucchero\n    *   1 uovo grande\n    *   Un pizzico di sale\n    *   2-3 cucchiai di acqua fredda (se necessario)\n*   **Per il ripieno:**\n    *   6-8 mele (Golden Delicious, Renetta, Granny Smith o un mix)\n    *   50g di zucchero (a seconda della dolcezza delle mele)\n    *   1 cucchiaino di cannella in polvere\n    *   Succo di mezzo limone\n    *   20g di burro a fiocchetti\n    *   2 cucchiai di pangrattato (per assorbire l'umidità)\n\n**Preparazione:**\n\n**1. Preparazione della pasta frolla:**\n\n*   **Mescola gli ingredienti secchi:** In una ciotola capie