If you don't want to install all the packages Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Here the Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).

### Installation

In [1]:
!pip install -U tensorflow -q
!pip install -U unsloth vllm -q
!pip install bitsandbytes accelerate peft -q

#### Continue Pre-Training

This dataset contains the full text of the **Italian Constitution**, article by article.
It is formatted for **raw text training** — each line in the JSONL file corresponds to a single article under the `"text"` field.

For example:

```text
Art. 1.
L’Italia è una Repubblica democratica, fondata sul lavoro.
La sovranità appartiene al popolo, che la esercita nelle forme e nei limiti della Costituzione.
```

This structure allows for **text completion** or **language modeling** tasks, where the model learns from continuous natural text rather than instruction–response pairs (as in `Alpaca`-style datasets).
You can fine-tune a model on this dataset to generate or complete legal-style text, or to explore language modeling over formal Italian institutional language.


In [2]:
%env UNSLOTH_RETURN_LOGITS=1 # Run this to disable CCE since it is not supported for CPT

env: UNSLOTH_RETURN_LOGITS=1 # Run this to disable CCE since it is not supported for CPT


In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-v0.3",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


2025-10-27 19:18:22.274245: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


INFO 10-27 19:18:28 [__init__.py:216] Automatically detected platform cuda.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.10.10: Fast Mistral patching. Transformers: 4.56.1. vLLM: 0.11.0.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 21.951 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

We also add `embed_tokens` and `lm_head` to allow the model to learn out of distribution data.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 !
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",

                      "embed_tokens", "lm_head",], # Add for continual pretraining
    lora_alpha = 32,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,
    loftq_config = None,
)

Unsloth: Offloading input_embeddings to disk to save VRAM
Unsloth: Offloading output_embeddings to disk to save VRAM


Unsloth 2025.10.10 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Unsloth: Training embed_tokens in mixed precision to save VRAM
Unsloth: Training lm_head in mixed precision to save VRAM


In [5]:
from datasets import load_dataset

dataset = load_dataset(
    "json",
    data_files="costituzione.jsonl",
    split='train'
)

EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    return { "text" : [example + EOS_TOKEN for example in examples["text"]] }
dataset = dataset.map(formatting_prompts_func, batched = True,)

In [6]:
dataset[0]

{'text': 'Art. 1.\nL’Italia e` una Repubblica democratica, fondata\nsul lavoro.\nLa sovranita` appartiene al popolo, che la esercita nelle forme e nei limiti della Costituzione.</s>'}

Print out 5 stories from the dataset file

In [7]:
for row in dataset[:5]["text"]:
    print("=========================")
    print(row)

Art. 1.
L’Italia e` una Repubblica democratica, fondata
sul lavoro.
La sovranita` appartiene al popolo, che la esercita nelle forme e nei limiti della Costituzione.</s>
Art. 2.
La Repubblica riconosce e garantisce i diritti
inviolabili dell’uomo, sia come singolo, sia nelle
formazioni sociali ove si svolge la sua personalita`,
e richiede l’adempimento dei doveri inderogabili
di solidarieta` politica, economica e sociale.</s>
Art. 4. La Repubblica riconosce a tutti i cittadini il diritto al lavoro e promuove le condizioni che rendano effettivo questo diritto. Ogni cittadino ha il dovere di svolgere, secondo le proprie possibilita` e la propria scelta, una attivita` o una funzione che concorra al progresso materiale o spirituale della societa`.</s>
Art. 5. La Repubblica, una e indivisibile, riconosce e promuove le autonomie locali; attua nei servizi che dipendono dallo Stato il piu` ampio decentramento amministrativo [118]; adegua i princı`pi ed i metodi della sua legislazione alle esige

# Let's test the model before Fine-Tuning

In [8]:
from transformers import TextIteratorStreamer
from threading import Thread
text_streamer = TextIteratorStreamer(tokenizer)
import textwrap
max_print_width = 100

# Before running inference, call `FastLanguageModel.for_inference` first

FastLanguageModel.for_inference(model)

inputs = tokenizer(
[
    """Art. 1.
L’Italia e`"""
]*1, return_tensors = "pt").to("cuda")

generation_kwargs = dict(
    inputs,
    streamer = text_streamer,
    max_new_tokens = 256,
    use_cache = True,
)
thread = Thread(target = model.generate, kwargs = generation_kwargs)
thread.start()

length = 0
for j, new_text in enumerate(text_streamer):
    if j == 0:
        wrapped_text = textwrap.wrap(new_text, width = max_print_width)
        length = len(wrapped_text[-1])
        wrapped_text = "\n".join(wrapped_text)
        print(wrapped_text, end = "")
    else:
        length += len(new_text)
        if length >= max_print_width:
            length = 0
            print()
        print(new_text, end = "")
    pass
pass

<s> Art. 1. L’Italiae` una Repubblica democratica, fondata sul lavoro, sulla libert`a e sulla 
solidariet`a.

Art. 2.
La Repubblica e` costituita da Stati, che hanno autonomia ordinaria e straordinaria, in 
base a norme costituzionali.

Art. 3.
La Repubblica e` una Repubblica sociale, che si propone di 
garantire la dignita` umana e di assicurare la partecipazione dei cittadini al potere politico, economico e 
sociale.

Art. 4.
La Repubblica e` una Repubblica unitaria e indivisibile.

Art. 5.
La Repubblica e` una 
Repubblica di diritto.

Art. 6.
La Repubblica e` una Repubblica di diritto pubblico, che si propone di 
garantire la tutela dei diritti fondamentali e di tutelare la libert`a e l’indipendenza della giustizia.


Art. 7.
La Repubblic

<a name="Train"></a>
### Continued Pretraining
Now let's use Unsloth's `UnslothTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer).

We are going to do 5 epochs.

Also set `embedding_learning_rate` to be a learning rate at least 2x or 10x smaller than `learning_rate` to make continual pretraining work!

In [9]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 8,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        warmup_ratio = 0.1,
        num_train_epochs = 5,

        learning_rate = 5e-5,
        embedding_learning_rate = 5e-6,

        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.00,
        lr_scheduler_type = "cosine",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use TrackIO/WandB etc
    ),
)

num_proc must be <= 11. Reducing num_proc to 11 for dataset of size 11.


In [10]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA L4. Max memory = 21.951 GB.
6.766 GB of memory reserved.


In [11]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 11 | Num Epochs = 5 | Total steps = 5
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 603,979,776 of 7,852,003,328 (7.69% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.0159
2,2.0167
3,0.7809
4,0.5019
5,0.3013


In [12]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

26.4263 seconds used for training.
0.44 minutes used for training.
Peak reserved memory = 10.242 GB.
Peak reserved memory for training = 3.476 GB.
Peak reserved memory % of max memory = 46.658 %.
Peak reserved memory for training % of max memory = 15.835 %.


<a name="Inference"></a>
### Inference
Let's run the model!

We first will try to see if the model follows the style and understands to write a story that is within the distribution of "Tiny Stories". Ie a story fit for a bed time story most likely.

We select "Once upon a time, in a galaxy, far far away," since it normally is associated with Star Wars.

In [13]:
from transformers import TextIteratorStreamer
from threading import Thread
text_streamer = TextIteratorStreamer(tokenizer)
import textwrap
max_print_width = 100

# Before running inference, call `FastLanguageModel.for_inference` first

FastLanguageModel.for_inference(model)

inputs = tokenizer(
[
    """Art. 1.
L’Italia e`"""
]*1, return_tensors = "pt").to("cuda")

generation_kwargs = dict(
    inputs,
    streamer = text_streamer,
    max_new_tokens = 256,
    use_cache = True,
)
thread = Thread(target = model.generate, kwargs = generation_kwargs)
thread.start()

length = 0
for j, new_text in enumerate(text_streamer):
    if j == 0:
        wrapped_text = textwrap.wrap(new_text, width = max_print_width)
        length = len(wrapped_text[-1])
        wrapped_text = "\n".join(wrapped_text)
        print(wrapped_text, end = "")
    else:
        length += len(new_text)
        if length >= max_print_width:
            length = 0
            print()
        print(new_text, end = "")
    pass
pass

<s> Art. 1. L’Italiae` una Repubblica democratica, fondata
sul lavoro.
La sovranita` appartiene al 
popolo, che la esercita nelle forme e nei limiti della Costituzione.</s>