It is from Unsloth Tutorial.

Link: [Ollama  + Unsloth + Llama-3 + CSV finetuning](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install unsloth
#The bitsandbytes library is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 & 4-bit quantization functions.
!pip install bitsandbytes
# Get latest Unsloth
!pip install --upgrade --no-deps "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# Install unsloth_zoo to get model
!pip install unsloth_zoo

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
model_name = "unsloth/llama-3-8b-bnb-4bit" # Choose llama3 8b bnb 4bit. More in here: https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.14: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/198 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

### Deep Dive to get_peft_model of FastLanguageModel

The function `get_peft_model` in `FastLanguageModel` is used to apply Parameter Efficient Fine-Tuning (PEFT) techniques, such as LoRA (Low-Rank Adaptation), to a base model.  

- **Key Variables & Their Effects:**  
  - `r`: Defines the rank of LoRA adapters, affecting memory usage and fine-tuning efficiency. Higher values (e.g., 128) increase adaptability but require more VRAM.  
  - `target_modules`: Specifies which layers to apply LoRA, influencing which parts of the model learn new patterns.  
  - `lora_alpha`: A scaling factor; higher values amplify LoRA updates, making tuning more aggressive.  
  - `lora_dropout`: Prevents overfitting; `0` is optimized for efficiency.  
  - `bias`: Controls bias handling; `"none"` optimizes performance.  
  - `use_gradient_checkpointing`: Reduces memory by recomputing activations; `"unsloth"` minimizes VRAM usage.  
  - `use_rslora`: Enables Rank Stabilized LoRA, improving convergence in some cases.  
  - `loftq_config`: Supports LoftQ, which quantizes LoRA for memory savings.  

Tweaking these parameters balances efficiency, performance, and adaptability when fine-tuning large models.


In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.3.14 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep

We'll now use the [Titanic dataset](https://www.kaggle.com/c/titanic), which is a CSV / Excel file with many columns. The goal is to predict whether some passengers managed to survive or perish based on their characteristics like their age, how much was their fare etc.

We uploaded it to our [HF repo](https://huggingface.co/datasets/unsloth/datasets/raw/main/titanic.csv), but you can upload a CSV by pressing the 📂 icon to the left and press the upload 🔼 button.

In [None]:
from datasets import load_dataset
dataset = load_dataset(
    "csv",
    data_files = "https://huggingface.co/datasets/unsloth/datasets/raw/main/titanic.csv",
    split = "train",
)
print(dataset.column_names)
print(dataset[0])

Downloading data:   0%|          | 0.00/61.2k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
{'PassengerId': 1, 'Survived': 0, 'Pclass': 3, 'Name': 'Braund, Mr. Owen Harris', 'Sex': 'male', 'Age': 22.0, 'SibSp': 1, 'Parch': 0, 'Ticket': 'A/5 21171', 'Fare': 7.25, 'Cabin': None, 'Embarked': 'S'}


One issue is this dataset has multiple columns. For `Ollama` and `llama.cpp` to function like a custom `ChatGPT` Chatbot, we must only have 2 columns - an `instruction` and an `output` column.

In [None]:
print(dataset.column_names)

['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']


To solve this, we shall do the following:
* Merge all columns into 1 instruction prompt.
* Remember LLMs are text predictors, so we can customize the instruction to anything we like!
* Use the `to_sharegpt` function to do this column merging process!

<img src="https://raw.githubusercontent.com/unslothai/unsloth/nightly/images/Merge.png" height="100">

To merge multiple columns into 1, use `merged_prompt`.
* Enclose all columns in curly braces `{}`.
* Optional text must be enclused in `[[]]`. For example if the column "Pclass" is empty, the merging function will not show the text and skp this. This is useful for datasets with missing values.
* You can select every column, or a few!
* Select the output or target / prediction column in `output_column_name`. For the Titanic dataset, this will be `Survived`.

For example, if we want to use the columns `Age` and `Fare`, we can do the following:



In [None]:
from unsloth import to_sharegpt
dataset_simple = to_sharegpt(
    dataset,
    merged_prompt = "[[Their age is {Age}.\n]][[They paid ${Fare} for the trip.\n]]",
    output_column_name = "Survived",
)

Merging columns:   0%|          | 0/891 [00:00<?, ? examples/s]

Converting to ShareGPT:   0%|          | 0/891 [00:00<?, ? examples/s]

We shall now provide a complex example using nearly all the columns in the dataset as shown below!

We also provide a setting called `conversation_extension`. This selects a few random rows in the dataset and combines them into 1 conversation. This allows the custom finetune to now not only work on only 1 user input, but many, allowing it be to a true chatbot like ChatGPT!

In [None]:
from unsloth import to_sharegpt
dataset = to_sharegpt(
    dataset,
    merged_prompt = \
        "[[The passenger embarked from {Embarked}.]]"\
        "[[\nThey are {Sex}.]]"\
        "[[\nThey have {Parch} parents and childen.]]"\
        "[[\nThey have {SibSp} siblings and spouses.]]"\
        "[[\nTheir passenger class is {Pclass}.]]"\
        "[[\nTheir age is {Age}.]]"\
        "[[\nThey paid ${Fare} for the trip.]]",
    conversation_extension = 5, # Randomnly combines conversations into 1! Good for long convos
    output_column_name = "Survived",
)

Flattening the indices:   0%|          | 0/891 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/891 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/891 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/891 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/891 [00:00<?, ? examples/s]

Extending conversations:   0%|          | 0/891 [00:00<?, ? examples/s]

Let's print out how the dataset looks like now:

In [None]:
from pprint import pprint
pprint(dataset[0])

{'conversations': [{'from': 'human',
                    'value': 'Their age is 22.0.\n'
                             'They paid $7.25 for the trip.\n'},
                   {'from': 'gpt', 'value': '0'},
                   {'from': 'human',
                    'value': 'Their age is 52.0.\n'
                             'They paid $79.65 for the trip.\n'},
                   {'from': 'gpt', 'value': '0'},
                   {'from': 'human',
                    'value': 'Their age is 9.0.\n'
                             'They paid $31.275 for the trip.\n'},
                   {'from': 'gpt', 'value': '0'},
                   {'from': 'human',
                    'value': 'They paid $7.8958 for the trip.\n'},
                   {'from': 'gpt', 'value': '0'},
                   {'from': 'human',
                    'value': 'Their age is 24.0.\n'
                             'They paid $13.0 for the trip.\n'},
                   {'from': 'gpt', 'value': '0'}]}


Finally use `standardize_sharegpt`! It converts all `user`, `assistant` and `system` tags to OpenAI Hugging Face style, since sometimes people use different tags like `human` for the `user` and `gpt` for the `assistant`. We require `user` and `assistant`.

In [None]:
from unsloth import standardize_sharegpt
dataset = standardize_sharegpt(dataset)

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/891 [00:00<?, ? examples/s]

### Customizable Chat Templates

You also need to specify a chat template. Previously, you could use the Alpaca format as shown below.

In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

The issue is the Alpaca format has 3 fields, whilst OpenAI style chatbots must only use 2 fields (instruction and response). That's why we used the `to_sharegpt` function to merge these columns into 1.

* Now, you have to use `{INPUT}` for the instruction and `{OUTPUT}` for the response.

In [None]:
chat_template = """Below describes some details about some passengers who went on the Titanic.
Predict whether they survived or perished based on their characteristics.
Output 1 if they survived, and 0 if they died.
>>> Passenger Details:
{INPUT}
>>> Did they survive?
{OUTPUT}"""

from unsloth import apply_chat_template
dataset = apply_chat_template(
    dataset,
    tokenizer = tokenizer,
    chat_template = chat_template,
    # default_system_message = "You are a helpful assistant", << [OPTIONAL]
)

Unsloth: We automatically added an EOS token to stop endless generations.


Map:   0%|          | 0/891 [00:00<?, ? examples/s]

We also allow you to use an optional `{SYSTEM}` field. This is useful for Ollama when you want to use a custom system prompt (also like in ChatGPT).

You can also not put a `{SYSTEM}` field, and just put plain text.

```python
chat_template = """{SYSTEM}
USER: {INPUT}
ASSISTANT: {OUTPUT}"""
```

Use below if you want to use the Llama-3 prompt format. You must use the `instruct` and not the `base` model if you use this!
```python
chat_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{SYSTEM}<|eot_id|><|start_header_id|>user<|end_header_id|>

{INPUT}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{OUTPUT}<|eot_id|>"""
```

For the ChatML format:
```python
chat_template = """<|im_start|>system
{SYSTEM}<|im_end|>
<|im_start|>user
{INPUT}<|im_end|>
<|im_start|>assistant
{OUTPUT}<|im_end|>"""
```

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Unsloth: We found double BOS tokens - we shall remove one automatically.


Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/891 [00:00<?, ? examples/s]

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
5.496 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 891 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mbachduong7103[0m ([33mdngback[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.7795
2,1.7382
3,1.7284
4,1.506
5,1.3061
6,1.1192
7,0.9461
8,0.7801
9,0.6632
10,0.572


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

462.1772 seconds used for training.
7.7 minutes used for training.
Peak reserved memory = 6.316 GB.
Peak reserved memory for training = 0.82 GB.
Peak reserved memory % of max memory = 42.846 %.
Peak reserved memory for training % of max memory = 5.563 %.


<a name="Inference"></a>
### Inference
Let's run the model! Unsloth makes inference natively 2x faster as well! You should use prompts which are similar to the ones you had finetuned on, otherwise you might get bad results!

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [                    # Change below!
    {"role": "user", "content": 'The passenger embarked from S.\n'\
                                'They are male.\n'\
                                'They have 1 siblings and spouses.\n'\
                                'Their passenger class is 3.\n'\
                                'Their age is 22.0.\n'\
                                'They paid $7.25 for the trip.'},
]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


0<|end_of_text|>


Let's try another example:

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [                    # Change below!
    {"role": "user", "content": 'Their passenger class is 1.\n'\
                                'Their age is 22.0.\n'\
                                'They paid $107.25 for the trip.'},
]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

0<|end_of_text|>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: write).
The token `WriteHuggingfaceTokens` has been saved to /root/.cache/huggingface/stored_tokens
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have t

In [None]:
model.push_to_hub("DngBack/unsloth_guild_csvtuning", token = "...") # Online saving
tokenizer.push_to_hub("DngBack/unsloth_guild_csvtuning", token = "...") # Online saving

README.md:   0%|          | 0.00/574 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.25k [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/DngBack/unsloth_guild_csvtuning


  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference
pass

messages = [                    # Change below!
    {"role": "user", "content": 'Their passenger class is 3.\n'\
                                'Their age is 22.0.\n'\
                                'They paid $107.25 for the trip.'},
]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

0<|end_of_text|>


You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")