This code is intended to run on Google Colab to use the available free GPU. </br>
This code loads the baseline model and produce one fine-tuned versione depending on the parameters. </br>
Automatic fine-tuning script for all models was avoided because the free Colab Version blocks extended use of GPU after a certain amount of time. </br>
Datasets were on Google Drive, they are now available on the Github page. </br>
Modify the following variables to your needs

In [None]:
csv_base_path = '/content/drive/My Drive/Deep Learning project/' # where train data are stored
save_base_path = '/content/drive/My Drive/Deep Learning project/Saved models/' # where fine-tuned models are saved

r_lora = 32 # 32 | 64 | 128
noEmbed = False # True => default lora target modules | False => default + "embed_tokens","lm_head"
fine_tune_dataset = "_reduced" # "" for full dataset | "_partial" | "_reduced"

#fine-tuning data
csv_data = csv_base_path + 'sicilian_dataset' + fine_tune_dataset + '_train.csv'

fine_tuned_model = save_base_path + 'r' + str(r_lora) + fine_tune_dataset

if noEmbed == True:
  target_modules_lora = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"]

  fine_tuned_model = fine_tuned_model + '_noEmbed' # default target modules

else:
  target_modules_lora = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj","embed_tokens","lm_head"]


In [None]:
%%capture
#packages
!pip install unsloth
!pip install --upgrade --no-deps "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [None]:
#import
import torch
from unsloth.models import FastLanguageModel
from google.colab import drive
import pandas as pd
from datasets import Dataset
import requests
from io import StringIO
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from transformers import TextStreamer
from google.colab import drive

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [None]:
#model + tokenizer for baseline model

max_seq_length = 512
dtype = None
load_in_4bit = True
my_model = "unsloth/llama-3-8b-Instruct-bnb-4bit"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = my_model,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit
)

==((====))==  Unsloth 2025.1.5: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.1k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

In [None]:
#LoRA for optimized fine-tuning

model = FastLanguageModel.get_peft_model(
    model,
    r = r_lora,
    target_modules = target_modules_lora,
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None
)

Unsloth 2025.1.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
#load the dataset: one column named "text" with sicilian sentences
#one model at the time, otherwise Google Colab shuts down the session
drive.mount('/content/drive')

df = pd.read_csv(csv_data)

df = df.dropna() # SFTTrainer gave problems

dataset = Dataset.from_pandas(df)

print(dataset.column_names)
print(dataset[0])

Mounted at /content/drive
['text']
{'text': 'Sta vuci è sulu un abbozzu (stub). siddu vuliti, putiti ammigghiuràrila secunnu li cumminzioni dâ wikipedia.'}


In [None]:
#trainer parameters

torch.cuda.empty_cache()

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 6,
        warmup_steps = 5,
        max_steps = 65,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/77883 [00:00<?, ? examples/s]

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.645 GB of memory reserved.


wandb API key:



In [None]:
torch.cuda.empty_cache()
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 77,883 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 6
\        /    Total batch size = 12 | Total steps = 65
 "-____-"     Number of trainable parameters = 83,886,080


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss
1,4.0003
2,4.2494
3,5.0995
4,3.8706
5,4.0943
6,3.5913
7,3.8004
8,4.1096
9,3.3146
10,3.4274


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

459.2505 seconds used for training.
7.65 minutes used for training.
Peak reserved memory = 6.836 GB.
Peak reserved memory for training = 1.191 GB.
Peak reserved memory % of max memory = 46.352 %.
Peak reserved memory for training % of max memory = 8.076 %.


In [None]:
# inference and chat template

FastLanguageModel.for_inference(model)

chat_template = """
USER: {INPUT}
ASSISTANT: {OUTPUT}"""

In [None]:
#try it out

messages = [
    {"role": "user", "content": "Quali sunnu i culuri primari?"},
]

#Quali sunnu li 5 principali citati siciliani?
#Cu je Dante Alighieri? scrivi 'na biografia curta.
#Quali sunnu li ionni dilla semana?
#Quali sunnu i culuri primari?
#Amicu me parrami dû vangelu di Santu Matteu
#Quannu è natu Gesù Cristu?
#Quannu si fistiggia 'u Natali Cristianu?
#mi putissi cuntari a storia dû pinocchiu?

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 512, pad_token_id = tokenizer.eos_token_id,temperature=0.75,repetition_penalty=1.1,top_k=60,top_p=0.95,length_penalty=1.1)


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


I culuri primari (o colori fondamentali) sunnu li cullori chi nun si ponnu reduci a nu mustru di ntra lu jancu e lu neri, comu:

* rulu (o rosso)
* blua
* giallu (o amariuru)

st'ulti trè culuri pri l'occhi umani sunnu sufficenti pi generari tutti li altri culura cu diverse combinazioni. 'n particulari, quarchiduni culuri pri l'elimentu pittorìa sunnu:

* rulu + blua = virdi
* rulu + giallu = aranciu
* blua + giallu = ciancu
* rulu + nera = vini
* blua + nera = viuli
* giallu + nera = neri o bavunisiu

ecc... e siccomu dicivi, sta lista nun è cumpluta! ma, na maniera assai approssimata, li culuri primari sunnu nu bon puntu di partenza puru.<|eot_id|>


In [None]:
# Save model and tokenizer

model.save_pretrained(fine_tuned_model)
tokenizer.save_pretrained(fine_tuned_model)