# 🐢 Unsloth — Fast & Memory-Efficient LLM Fine-Tuning

[Unsloth](https://github.com/unslothai/unsloth) is an open-source library that makes fine-tuning large language models (LLMs) **much faster and lighter** on your GPU.

---

## 🚀 Key Features
- ⚡ 2–5× faster fine-tuning than standard methods  
- 💾 Lower VRAM usage with 4-bit & 8-bit quantization  
- 🧠 Supports LoRA / QLoRA for parameter-efficient fine-tuning  
- 🖥️ Works even on consumer GPUs like RTX 3060 / 4090  
- 🤝 Integrates with Hugging Face Transformers & PEFT

---

## 🧪 Typical Use Cases
- Fine-tuning LLMs (e.g. Llama 3, Mistral, Falcon) on custom datasets  
- Building domain-specific chatbots, assistants, or code models  
- Running experiments on smaller GPUs

---

## 📦 Unsloth Installation


In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

## ⚡ Load a 4-bit Quantized Model with 🐢 Unsloth

The following code loads a 4-bit quantized large language model (LLM) using [Unsloth](https://github.com/unslothai/unsloth).  
This makes fine-tuning **much faster and more memory efficient**, even on consumer GPUs.

We'll use **Meta-Llama-3.1-8B-Instruct** as an example.


In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.9.4: Fast Llama patching. Transformers: 4.56.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

## 🧩 What is LoRA (Low-Rank Adaptation)?

**LoRA** is a technique that makes fine-tuning **large language models (LLMs)** more **efficient** by updating only a **small number of additional parameters**, instead of all the model weights.

---

### ⚙️ How It Works
- Normally, fine-tuning updates **billions of parameters**, which requires a lot of GPU memory and time.
- **LoRA freezes the original model weights** and adds **small trainable "adapter" layers** (low-rank matrices) inside certain layers (like attention layers).
- During training:
  - Only these small adapter layers are updated.
  - The original pretrained weights stay the same.
- At inference time, the **LoRA adapter weights are merged** with the base model.

---

### 📈 Why Use LoRA
- 🧠 **Memory-efficient** — Needs much less GPU memory
- ⚡ **Faster training** — Updates far fewer parameters
- 💾 **Lightweight** — Only saves a small adapter file (a few MBs)
- 🔁 **Composable** — You can apply multiple LoRA adapters to the same base model

---

### 💡 Typical Use Cases
- Fine-tuning LLMs on domain-specific data (medical, legal, finance, etc.)
- Creating personalized chatbots or assistants
- Training multiple versions of a model without storing full copies


=> We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.9.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Preparation
We now use the `Llama-3.1` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format `("role", "content")` instead of `("from", "value")`/ Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

In [5]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    texts = []
    convos = []  # Create a list to store conversations

    for question, answer in zip(examples['question'], examples['answer']):
        # Create a conversation-like structure
        convo = [
            {'role': 'user', 'content': question},
            {'role': 'assistant', 'content': answer}
        ]
        convos.append(convo)  # Add convo to the list

        # Apply the chat template
        formatted_text = tokenizer.apply_chat_template(
            convo, tokenize=False, add_generation_prompt=False
        )
        texts.append(formatted_text)

    return {'text': texts, 'conversations': convos}  # Return the list of convos
pass



In [6]:
from datasets import load_dataset
dataset = load_dataset("griffin/election-synthetic", split = "train")
print(dataset[:5])

{'question': ['Who did Kamala Harris select to be her vice presidential candidate in her campaign against Donald Trump?', 'True or False?\n\nAll governors are potential vice presidential candidates.', 'Who is leading the Democratic ticket in the 2024 presidential race?', 'If Kamala Harris and Tim Walz are standing 5 meters apart and a photographer wants to take their picture, what is the minimum focal length needed for a full-body shot if the camera sensor is 36mm wide and they occupy 80% of the frame width?', 'For which White House contender is the current leader of the Gopher State serving as the prospective vice president?'], 'answer': ["Tim Walz, Minnesota's Governor", 'False', 'Democratic presidential nominee', '4000mm', 'Kamala Harris'], 'meta': ['{"entity": "Tim Walz, Governor of Minnesota", "hops": 0}', '{"claim": "All governors are potential vice presidential candidates."}', '{"entity": "Presidential nominee", "hops": 0}', '{"reasoning_type": "physical"}', '{"hops": 4, "multi_

## 📝 Formatting the Dataset

Before fine-tuning, we need to **convert our raw dataset into text prompts** that the model can understand.

We define a custom function `formatting_prompts_func` which takes each example and formats it into the desired **prompt → response** structure.

Then we apply it to every example in the dataset using:

```python
dataset = dataset.map(formatting_prompts_func, batched=True)


In [8]:
dataset = dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/849 [00:00<?, ? examples/s]

We look at how the conversations are structured for item 5:

In [9]:
dataset[5]["conversations"]

[{'content': 'Should Tim Walz be morally obligated to resign as governor of Minnesota upon accepting the vice presidential nomination?',
  'role': 'user'},
 {'content': "While there's no legal requirement, there is a strong moral argument that Tim Walz should resign as governor of Minnesota upon accepting the vice presidential nomination. This is because:\n\n1. Full commitment: The roles of governor and vice presidential candidate both demand full-time attention. Resigning would allow Walz to fully commit to the campaign without shortchanging his gubernatorial duties.\n\n2. Avoiding conflicts of interest: As a sitting governor, Walz might face situations where his decisions could be perceived as motivated by campaign interests rather than the best interests of Minnesota.\n\n3. Fair representation for Minnesota: The state deserves a full-time governor focused solely on state issues, not divided attention.\n\n4. Transparency and integrity: Resigning would demonstrate a commitment to ethi

And we see how the chat template transformed these conversations.

**[Notice]** Llama 3.1 Instruct's default chat template default adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`, so do not be alarmed!

In [10]:
dataset[5]["text"]

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nShould Tim Walz be morally obligated to resign as governor of Minnesota upon accepting the vice presidential nomination?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nWhile there's no legal requirement, there is a strong moral argument that Tim Walz should resign as governor of Minnesota upon accepting the vice presidential nomination. This is because:\n\n1. Full commitment: The roles of governor and vice presidential candidate both demand full-time attention. Resigning would allow Walz to fully commit to the campaign without shortchanging his gubernatorial duties.\n\n2. Avoiding conflicts of interest: As a sitting governor, Walz might face situations where his decisions could be perceived as motivated by campaign interests rather than the best interests of Minnesota.\n\n3. Fair representatio

## 🚀 Setting Up the Trainer

Now that our dataset is ready, we'll use the **SFTTrainer** from the `trl` library  
to fine-tune our model with the LoRA adapters attached.

Key parts:

- **SFTTrainer** — A high-level trainer for supervised fine-tuning (SFT) on text datasets.
- **TrainingArguments** — Controls all training hyperparameters like learning rate, batch size, logging, etc.
- **DataCollatorForSeq2Seq** — Prepares batches of tokenized data for training.

Important settings used:
- `per_device_train_batch_size = 2` — Small batch size (good for limited VRAM)
- `gradient_accumulation_steps = 4` — Accumulate gradients to simulate a larger batch
- `max_steps = 60` — Total number of training steps
- `fp16` / `bf16` — Mixed precision for faster training
- `optim = "adamw_8bit"` — 8-bit optimizer for memory efficiency
- `output_dir = "outputs"` — Folder where checkpoints are saved


<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [11]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/849 [00:00<?, ? examples/s]

## 🎯 Train Only on Assistant Outputs

We will use **Unsloth's `train_on_completions` method** so that:

- The model **only learns from the assistant's responses (completions)**  
- It will **ignore the loss on the user inputs (prompts)**

This is useful for **instruction tuning**, where we want the model to generate good answers,  
not to memorize the prompts themselves.


In [12]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map (num_proc=2):   0%|          | 0/849 [00:00<?, ? examples/s]

We verify masking is actually done:

In [13]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

"<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nShould Tim Walz be morally obligated to resign as governor of Minnesota upon accepting the vice presidential nomination?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nWhile there's no legal requirement, there is a strong moral argument that Tim Walz should resign as governor of Minnesota upon accepting the vice presidential nomination. This is because:\n\n1. Full commitment: The roles of governor and vice presidential candidate both demand full-time attention. Resigning would allow Walz to fully commit to the campaign without shortchanging his gubernatorial duties.\n\n2. Avoiding conflicts of interest: As a sitting governor, Walz might face situations where his decisions could be perceived as motivated by campaign interests rather than the best interests of Minnesota.\n\n3. F

In [14]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

"                                                        While there's no legal requirement, there is a strong moral argument that Tim Walz should resign as governor of Minnesota upon accepting the vice presidential nomination. This is because:\n\n1. Full commitment: The roles of governor and vice presidential candidate both demand full-time attention. Resigning would allow Walz to fully commit to the campaign without shortchanging his gubernatorial duties.\n\n2. Avoiding conflicts of interest: As a sitting governor, Walz might face situations where his decisions could be perceived as motivated by campaign interests rather than the best interests of Minnesota.\n\n3. Fair representation for Minnesota: The state deserves a full-time governor focused solely on state issues, not divided attention.\n\n4. Transparency and integrity: Resigning would demonstrate a commitment to ethical governance and avoid the appearance of using one office to campaign for another.\n\n5. Precedent: Many politi

We can see the System and Instruction prompts are successfully masked!

Show current memory stats

In [15]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
6.881 GB of memory reserved.


### Model Training

In [16]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 849 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


Step,Training Loss
1,5.1763
2,3.5861
3,3.1942
4,3.8887
5,3.1853
6,1.8817
7,2.3681
8,2.2149
9,1.5772
10,1.217


Unsloth: Will smartly offload gradients to save VRAM!


In [17]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

267.9717 seconds used for training.
4.47 minutes used for training.
Peak reserved memory = 6.881 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 46.679 %.
Peak reserved memory for training % of max memory = 0.0 %.


## 💬 Running Inference with the Fine-Tuned Model

Now we'll:

1. Apply a **chat template** to structure our messages (user/assistant roles).
2. Enable **fast inference mode** using `FastLanguageModel.for_inference(model)`.
3. Generate model responses with `model.generate()`.

Key points:
- `get_chat_template(...)` applies a predefined conversation format (here `"llama-3.1"`).
- `apply_chat_template(...)` formats and tokenizes the messages.
- `add_generation_prompt=True` ensures the model knows it should continue the conversation.
- `temperature` controls randomness of outputs (higher = more creative).
- `min_p` is a sampling parameter to encourage diverse tokens.

This will produce the model’s response to the user prompt.


In [18]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "who is partnering with Kamala Harris on the Democratic ticket In the 2024 presidential race?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nwho is partnering with Kamala Harris on the Democratic ticket In the 2024 presidential race?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nTim Walz<|eot_id|>']

We can also use a TextStreamer for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [19]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "who is partnering with Kamala Harris on the Democratic ticket In the 2024 presidential race?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

Tim Walz<|eot_id|>


## 💾 Saving the Fine-Tuned Model

After training, we can save the model and tokenizer for later use:

### Local Saving
```python
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")

<a name="Save"></a>
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [20]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/chat_template.jinja',
 'lora_model/tokenizer.json')

## 🔄 Loading and Generating with a Trained LoRA Model

This cell demonstrates:

1. **Loading a fine-tuned LoRA model** from local storage (`lora_model`)  
   - Uses `FastLanguageModel.from_pretrained(...)`
   - Supports 4-bit quantization and native fast inference.

2. **Preparing a chat message**:
   - The user prompt is structured using `apply_chat_template(...)`.
   - `add_generation_prompt=True` signals the model to generate a completion.

3. **Streaming model output**:
   - `TextStreamer` from Transformers streams tokens to the console in real-time.
   - `max_new_tokens=128` limits the generation length.
   - `temperature` and `min_p` control randomness and diversity.

This setup allows interactive usage of your fine-tuned model without waiting for the entire output to finish.

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [21]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Describe a tall tower in the capital of France."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

The Eiffel Tower<|eot_id|>


You can also use Hugging Face's AutoModelForPeftCausalLM. Only use this if you do not have unsloth installed. It can be hopelessly slow, since 4bit model downloading is not supported, and Unsloth's inference is 2x faster.

In [22]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

## 💾 Saving Merged or LoRA-Only Models

After training, you can save your fine-tuned model in **different formats**:

### 1️⃣ Merge LoRA into 16-bit model
```python
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
# Optional: push to Hugging Face Hub
# model.push_to_hub_merged("hf/model", tokenizer, save_method="merged_16bit", token="")


In [23]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

## 💾 Saving Models in GGUF Format


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [24]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**