### Libraries Installation

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

### Model Initialization

In [None]:
from unsloth import FastLanguageModel
import torch

# default values chosen by Unsloth for us!
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally! (maximum number of tokens (words/subwords) the model can process at once)
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

llm, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose  "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.5.8: Fast Llama patching. Transformers: 4.52.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/3.83k [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

"*LoRA (Low-Rank Adaptation of Large Language Models) is a popular and lightweight training technique that significantly reduces the number of trainable parameters. It works by inserting a smaller number of new weights into the model and only these are trained. This makes training with LoRA much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), which are easier to store and share.*" - HuggingFace

We can modify the following numbers to increase accuracy, but also counteract over-fitting.

## Some parameters definition:
- **r**: The rank of the finetuning process. A larger number uses more memory and will be slower, but can increase accuracy on harder tasks. We normally suggest numbers like 8 (for fast finetunes), and up to 128. Too large numbers can causing over-fitting, damaging your model's quality.
- **target_modules**: Select which parts of the models should be modified by Lora. We select the most important and sensitive modules in transformer models because by updating only these, we can adapt the model to new tasks without changing everything (making it much lighter!).
- **lora_alpha**: The scaling factor for finetuning. A larger number will make the finetune learn more about your dataset, but can promote over-fitting. We suggest this to equal to the rank r, or double it.

In [None]:
# default parameters for LoRA (peft=Parameter Efficient Fine-Tuning)
model = FastLanguageModel.get_peft_model(
    llm,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", # attention mechanisms modules
                      "gate_proj", "up_proj", "down_proj",], # feed-forward modules
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context and reduce memory usage by an extra 30%
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.5.8 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [None]:
# default parameters for LoRA (peft=Parameter Efficient Fine-Tuning)
model_ft = FastLanguageModel.get_peft_model(
    llm,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", # attention mechanisms modules
                      "gate_proj", "up_proj", "down_proj",], # feed-forward modules
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context and reduce memory usage by an extra 30%
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)



### Test base LLama model with generic question

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.2",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "What is fibonacci serie?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 1024,
                   use_cache = True, temperature = 0.7, min_p = 0.1)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


The Fibonacci series is a series of numbers in which each number is the sum of the two preceding numbers, starting from 0 and 1. This series is named after the Italian mathematician Leonardo Fibonacci, who introduced it in the 13th century.

The Fibonacci series begins like this:

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, and so on.

The key property of the Fibonacci series is that each number is the sum of the two preceding numbers. For example:

- 0 + 1 = 1
- 1 + 1 = 2
- 1 + 2 = 3
- 2 + 3 = 5
- 3 + 5 = 8
- 5 + 8 = 13
- 8 + 13 = 21
- 13 + 21 = 34
- And so on.

The Fibonacci series appears in many areas of mathematics, science, and nature, such as:

1. **Biology**: The arrangement of leaves on a stem, the branching of trees, and the flowering of artichokes follow a Fibonacci pattern.
2. **Finance**: The Fibonacci retracement levels are used in technical analysis to predict price movements in financial markets.
3. **Geometry**: The Fibonacci spiral is a curve that gets wider by a fa

### Test base LLama model with company personal information question

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.2",
)
FastLanguageModel.for_inference(model_ft) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "How can I request a new company laptop?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model_ft.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 1024,
                   use_cache = True, temperature = 0.7, min_p = 0.1)

Here's a step-by-step guide on how to request a new company laptop:

**Before You Request:**

1. **Check company policies**: Review your company's IT department or HR policies to see if there are any specific guidelines or requirements for requesting a new laptop.
2. **Assess your needs**: Consider your work requirements and whether a new laptop is necessary. If you're due for an upgrade, use this opportunity to request a new one.

**Requesting a New Laptop:**

1. **Schedule a meeting with your supervisor or IT representative**: Request a meeting to discuss your laptop needs and request a new one.
2. **Prepare your request**: Before the meeting, make a list of your laptop requirements, including:
	* Your current laptop's specifications and condition
	* Your work requirements and how a new laptop will improve your productivity
	* Any specific features or requirements you need (e.g., touchscreen, stylus support, etc.)
3. **During the meeting**:
	* Present your request and explain why you

### Load dataset

In [None]:
import json
import pandas as pd

# Specify the path to your JSONL file
jsonl_file_path = 'company_internal_processes_en.jsonl'

data = []
with open(jsonl_file_path, 'r') as f:
  for line in f:
    data.append(json.loads(line))

df = pd.DataFrame(data)

In [None]:
df.head(10)

Unnamed: 0,question,answer
0,How do I request a new company laptop?,"To request a new company laptop, fill out the ..."
1,Who is the contact person for IT security trai...,The contact person for IT security training is...
2,Where can I find the forms for travel expense ...,The forms can be found in the 'Company Documen...
3,What is the procedure for reporting a GDPR pol...,Report the violation via email to privacy@fint...
4,Who approves leave requests longer than 10 days?,Leave requests longer than 10 days must be app...
5,How do I obtain access to the company VPN from...,Send a request to supporto.it@fintaazienda.com...
6,To whom should monthly expense reports be sent?,Monthly expense reports should be sent to ammi...
7,Where can I find the server maintenance schedule?,The server maintenance schedule is published o...
8,Who is the contact person for company benefits?,"For company benefits, you can contact Chiara B..."
9,How do I request access to new management soft...,Request access by filling out the 'Software Ac...


### Format it for fine-tuning

In [None]:
from datasets import Dataset
df["conversations"] = df.apply(
    lambda x: [
        {"content": x["question"], "role": "user"},
        {"content": x["answer"], "role": "assistant"}
    ], axis=1
)

# drop old columns since we now have a single column containing both question and answer formatted as needed
dataset = Dataset.from_pandas(df.drop(columns=["question", "answer"]))

In [None]:
dataset["conversations"][:3] # list of lists of dictionaries

[[{'content': 'How do I request a new company laptop?', 'role': 'user'},
  {'content': "To request a new company laptop, fill out the 'IT Asset Request' form on the intranet and send it to asset.it@fintaazienda.com. Approval from your manager is required.",
   'role': 'assistant'}],
 [{'content': 'Who is the contact person for IT security training?',
   'role': 'user'},
  {'content': 'The contact person for IT security training is Matteo Lorusso from the IT Security department.',
   'role': 'assistant'}],
 [{'content': 'Where can I find the forms for travel expense reporting?',
   'role': 'user'},
  {'content': "The forms can be found in the 'Company Documents' section of the intranet, under 'Administration'.",
   'role': 'assistant'}]]

### Format conversation column adding tags used to train LLama model

In [None]:
# Format the conversations column into a single string using the tokenizer's chat template
def format_conversations(example):
    # Apply the chat template to the list of messages
    # The tokenizer handles the list of dicts and outputs a single formatted string
    example["formatted_conversations"] = tokenizer.apply_chat_template(example["conversations"], tokenize=False, add_generation_prompt=False)
    return example

# Apply the formatting function to the dataset
dataset = dataset.map(format_conversations, num_proc=2)

Map (num_proc=2):   0%|          | 0/100 [00:00<?, ? examples/s]

In [None]:
dataset["conversations"][0]

[{'content': 'How do I request a new company laptop?', 'role': 'user'},
 {'content': "To request a new company laptop, fill out the 'IT Asset Request' form on the intranet and send it to asset.it@fintaazienda.com. Approval from your manager is required.",
  'role': 'assistant'}]

In [None]:
dataset["formatted_conversations"][0] # string with Llama 3.2 tags separators

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do I request a new company laptop?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nTo request a new company laptop, fill out the 'IT Asset Request' form on the intranet and send it to asset.it@fintaazienda.com. Approval from your manager is required.<|eot_id|>"

### Train the model!
We will use HuggingFace **TRL's SFTTrainer** (Transformer Reinforcement Learning - Supervised Fine Tuning).
- TRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT) and others.
- Supervised fine-tuning (SFT) is the most common step in post-training foundation models, and also one of the most effective.

We do 40 steps to speed things up, but we can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model_ft,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "formatted_conversations",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 40, # for faster training
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    )
)

Unsloth: Tokenizing ["formatted_conversations"] (num_proc=2):   0%|          | 0/100 [00:00<?, ? examples/s]

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
6.779 GB of memory reserved.


### Let's train our model to let it focus on answers only, in order to let it understand the knowledge of our company, disreguarding the types of question that might be asked.


The dataset consists of question-answer pairs related to company procedures (e.g., requesting a laptop, reporting GDPR violations).

- **Focus on Assistant Responses**: The dataset’s responses hold key procedural knowledge (e.g., forms, emails, approvals). Training only on responses ensures the model learns output style and content without processing redundant questions.

- **Computational Efficiency**: Excluding questions reduces memory usage and speeds up fine-tuning, leveraging Unsloth’s optimizations for up to 2x faster training.




In [None]:
from unsloth.chat_templates import train_on_responses_only

trainer_on_responses_only = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map (num_proc=2):   0%|          | 0/100 [00:00<?, ? examples/s]

In [None]:
trainer_on_responses_only_stats = trainer_on_responses_only.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 100 | Num Epochs = 4 | Total steps = 40
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856/3,237,063,680 (0.75% trained)


Step,Training Loss
1,3.9566
2,3.7941
3,3.7272
4,4.3573
5,3.1385
6,3.0557
7,3.1119
8,2.0338
9,2.1718
10,2.51


### Let's test the new fine-tuned model!

In [None]:
FastLanguageModel.for_inference(model_ft) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "How can I request a new company laptop?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model_ft.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 1024,
                   use_cache = True, temperature = 0.7, min_p = 0.1)

Fill out the 'Laptop Request' form on the intranet and send it to IT@fintaazienda.com.<|eot_id|>


### Export the fine-tuned model!

In [None]:
model_ft.save_pretrained_gguf("./model", tokenizer, quantization_method = "f16")

# Save to q4_k_m GGUF
# model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")

# Save to 8bit Q8_0
# model.save_pretrained_gguf("model", tokenizer,)

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 6.4G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 6.01 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:01<00:00, 25.70it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving ./model/pytorch_model-00001-of-00002.bin...
Unsloth: Saving ./model/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['f16'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at ./model into f16 GGUF format.
The output location will be /content/model/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model
INFO:hf-to-gguf:Model architecture: LlamaForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: 

In [None]:
import os
from google.colab import files

folder_path = 'model'
for filename in os.listdir(folder_path):
    file_path = os.path.join(folder_path, filename)
    if os.path.isfile(file_path):
      if file_path.endswith("unsloth.F16.gguf"):
        files.download(file_path)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>