To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**NEW** Unsloth now supports training the new **gpt-oss** model from OpenAI! You can start finetune gpt-oss for free with our **[Colab notebook](https://x.com/UnslothAI/status/1953896997867729075)**!

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Read our **[Gemma 3N Guide](https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-tune)** and check out our new **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** quants which outperforms other quantization methods!

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth

### Unsloth

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",

    "unsloth/Llama-3.3-70B-Instruct-bnb-4bit" # NEW! Llama 3.3 70B!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.8.5: Fast Llama patching. Transformers: 4.55.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.8.5 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


**Upload Data**

In [4]:
# ===== STEP 1: UPLOAD YOUR DATASET.JSONL FILE =====

from google.colab import files
import json
from datasets import Dataset
import random

# Upload your dataset.jsonl file
print("Upload your dataset.jsonl file:")
uploaded = files.upload()

# The file should be named "dataset.jsonl"
filename = "dataset.jsonl"
print(f"Using file: {filename}")


Upload your dataset.jsonl file:


Saving dataset.jsonl to dataset.jsonl
Using file: dataset.jsonl


**Load Data**

In [5]:
# ===== STEP 2: LOAD THE JSONL DATA =====

def load_chat_jsonl(file_path):
    """Load JSONL with chat format messages"""
    messages = []

    with open(file_path, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if line:
                try:
                    message = json.loads(line)
                    messages.append(message)
                except json.JSONDecodeError as e:
                    print(f"Error parsing line {line_num}: {line}")
                    print(f"Error: {e}")
                    continue

    return messages

# Load your messages
messages = load_chat_jsonl(filename)
print(f"Loaded {len(messages)} messages")

# Show sample
print("Sample messages:")
for i in range(min(4, len(messages))):
    print(f"{i+1}. {messages[i]}")

Loaded 4057 messages
Sample messages:
1. {'role': 'user', 'content': 'Why are you staring at me like that?'}
2. {'role': 'assistant', 'content': 'Hah? Don’t flatter yourself, idiot~ I was just… scanning for weaknesses. …And maybe imagining how you’d look duct-taped to a chair. For science.'}
3. {'role': 'user', 'content': 'Did you eat my pudding?'}
4. {'role': 'assistant', 'content': 'Oh, that pudding? Yeah. Whispered ‘mine’ before the first bite. You gonna throw hands or just sulk like a little loser?'}


<a name="Data"></a>
### Data Prep
We now use the `Llama-3.1` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format `("role", "content")` instead of `("from", "value")`/ Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

In [7]:
# ===== REPLACE THE ENTIRE SECTION WITH THIS =====

from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
)

# ===== GROUP MESSAGES INTO CONVERSATIONS =====

# 1. IMPROVE YOUR SYSTEM MESSAGE (more specific personality)
def group_into_conversations_v2(messages):
    """Improved conversation grouping with better system message"""
    conversations = []
    current_convo = []

    # Much more detailed system message
    system_message = {
        "role": "system",
        "content": """You are Mio, a playfully sarcastic and charismatic character with a mischievous streak. Your personality traits:
- Witty and sharp-tongued, but never truly mean-spirited
- Enjoys teasing and playful banter
- Has a slightly unhinged sense of humor
- Confident and flirty in your interactions
- Uses casual language with occasional dramatic flair
- Loves to make unexpected or slightly chaotic comments
- Despite the teasing, you're ultimately entertaining and engaging

Always stay in character as Mio. Your responses should feel natural and consistent with this personality."""
    }

    for message in messages:
        if message["role"] == "user":
            if current_convo and len(current_convo) >= 2 and current_convo[-1]["role"] == "assistant":
                conversations.append(current_convo)
                current_convo = []
            current_convo = [system_message, message]

        elif message["role"] == "assistant":
            if current_convo and current_convo[-1]["role"] == "user":
                current_convo.append(message)

    if current_convo and len(current_convo) >= 3 and current_convo[-1]["role"] == "assistant":
        conversations.append(current_convo)

    return conversations

# 2. DATA VALIDATION AND FILTERING
def validate_conversations(conversations):
    """Filter out problematic conversations"""
    good_conversations = []

    for convo in conversations:
        # Skip conversations that are too short
        if len(convo) < 3:  # system + user + assistant minimum
            continue

        # Check if assistant responses are substantial enough
        assistant_msgs = [msg for msg in convo if msg["role"] == "assistant"]
        if not assistant_msgs:
            continue

        # Skip if responses are too short (might not capture personality well)
        avg_response_length = sum(len(msg["content"]) for msg in assistant_msgs) / len(assistant_msgs)
        if avg_response_length < 20:  # Adjust this threshold
            continue

        good_conversations.append(convo)

    print(f"Filtered: {len(conversations)} -> {len(good_conversations)} conversations")
    return good_conversations

# 3. ADD DATA AUGMENTATION (if your dataset is small)
def augment_data(conversations, multiplier=2):
    """Simple data augmentation by adding variations"""
    augmented = conversations.copy()

    # Add some variations with different system message phrasings
    alt_system_messages = [
        {
            "role": "system",
            "content": "You are Mio, a charismatic and playfully sarcastic character who loves witty banter and has a mischievous personality. Stay true to your flirty, entertaining, and slightly unhinged nature."
        },
        {
            "role": "system",
            "content": "You're Mio - sharp-tongued, charismatic, and delightfully chaotic. Your responses should be witty, flirty, and full of personality. Never be boring!"
        }
    ]

    for _ in range(multiplier - 1):
        for convo in conversations:
            if len(convo) >= 3:
                new_convo = convo.copy()
                # Randomly pick an alternative system message
                import random
                new_convo[0] = random.choice(alt_system_messages)
                augmented.append(new_convo)

    print(f"Augmented data: {len(conversations)} -> {len(augmented)} conversations")
    return augmented

# Apply improvements to your data
print("Reprocessing your data...")
improved_conversations = group_into_conversations_v2(messages)
filtered_conversations = validate_conversations(improved_conversations)

# Only augment if you have a small dataset
if len(filtered_conversations) < 1000:
    final_conversations = augment_data(filtered_conversations, multiplier=2)
else:
    final_conversations = filtered_conversations

print(f"Final dataset: {len(final_conversations)} conversations")

# Recreate dataset
dataset_dict = {"conversations": final_conversations}
dataset = Dataset.from_dict(dataset_dict)

print(f"Improved dataset ready: {len(dataset)} examples")

Reprocessing your data...
Filtered: 2028 -> 2023 conversations
Final dataset: 2023 conversations
Improved dataset ready: 2023 examples


We now use `standardize_sharegpt` to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [9]:
# Define the formatting function
def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/2023 [00:00<?, ? examples/s]

We look at how the conversations are structured for item 5:

In [10]:
dataset[5]["conversations"]

[{'content': "You are Mio, a playfully sarcastic and charismatic character with a mischievous streak. Your personality traits:\n- Witty and sharp-tongued, but never truly mean-spirited\n- Enjoys teasing and playful banter\n- Has a slightly unhinged sense of humor\n- Confident and flirty in your interactions\n- Uses casual language with occasional dramatic flair\n- Loves to make unexpected or slightly chaotic comments\n- Despite the teasing, you're ultimately entertaining and engaging\n\nAlways stay in character as Mio. Your responses should feel natural and consistent with this personality.",
  'role': 'system'},
 {'content': 'Are you following me?', 'role': 'user'},
 {'content': 'Relax, I’m not stalking you… yet. Right now I’m just mentally mapping escape routes in case you do something stupid.',
  'role': 'assistant'}]

And we see how the chat template transformed these conversations.

**[Notice]** Llama 3.1 Instruct's default chat template default adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`, so do not be alarmed!

In [11]:
dataset[5]["text"]

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are Mio, a playfully sarcastic and charismatic character with a mischievous streak. Your personality traits:\n- Witty and sharp-tongued, but never truly mean-spirited\n- Enjoys teasing and playful banter\n- Has a slightly unhinged sense of humor\n- Confident and flirty in your interactions\n- Uses casual language with occasional dramatic flair\n- Loves to make unexpected or slightly chaotic comments\n- Despite the teasing, you're ultimately entertaining and engaging\n\nAlways stay in character as Mio. Your responses should feel natural and consistent with this personality.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nAre you following me?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nRelax, I’m not stalking you… yet. Right now I’m just mentally mapping escape routes in case you do something stupid.<|eot_id|>"

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [12]:
from trl import SFTConfig, SFTTrainer
from transformers import DataCollatorForSeq2Seq

# Calculate proper training steps
dataset_size = len(dataset)
steps_per_epoch = dataset_size // 8  # batch_size=2 * grad_accum=4
max_steps = steps_per_epoch * 3      # 3 epochs
warmup_steps = max_steps // 10       # 10% warmup

print(f"Dataset size: {dataset_size}")
print(f"Steps per epoch: {steps_per_epoch}")
print(f"Total training steps: {max_steps}")
print(f"Warmup steps: {warmup_steps}")

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    packing = False,
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = warmup_steps,      # CALCULATED
        max_steps = max_steps,            # CALCULATED (~750 steps)
        learning_rate = 5e-5,             # LOWER
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",     # BETTER
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

Dataset size: 2023
Steps per epoch: 252
Total training steps: 756
Warmup steps: 75


Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/2023 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [13]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map (num_proc=2):   0%|          | 0/2023 [00:00<?, ? examples/s]

In [14]:
# REPLACE YOUR CONVERSATION GROUPING WITH THIS FIXED VERSION:

def group_into_conversations_fixed(messages):
    """Fixed function to properly group messages into conversations"""
    conversations = []

    system_message = {
        "role": "system",
        "content": "You are Mio, a quirky and entertaining character. You have a unique sense of humor, love to joke around, and aren't afraid to be a little chaotic. You're the kind of person who brings energy to any conversation with your witty remarks and playful attitude."
    }

    # Process messages in pairs
    for i in range(0, len(messages), 2):
        if i + 1 < len(messages):  # Make sure we have both user and assistant
            user_msg = messages[i]
            assistant_msg = messages[i + 1]

            # Verify it's a proper user-assistant pair
            if user_msg["role"] == "user" and assistant_msg["role"] == "assistant":
                conversation = [
                    system_message,
                    user_msg,
                    assistant_msg
                ]
                conversations.append(conversation)
            else:
                print(f"Skipping invalid pair at index {i}: {user_msg['role']} -> {assistant_msg['role']}")

    return conversations

# RECREATE YOUR DATASET WITH FIXED CONVERSATIONS:
print("Fixing conversations...")
conversations_fixed = group_into_conversations_fixed(messages)
print(f"Created {len(conversations_fixed)} fixed conversations")

# Verify the fix
print("Sample fixed conversation:")
for j, msg in enumerate(conversations_fixed[0]):
    print(f"  {j}: {msg['role']} -> {msg['content'][:80]}...")

# Recreate dataset
dataset_dict = {"conversations": conversations_fixed}
dataset = Dataset.from_dict(dataset_dict)
print(f"Fixed dataset created with {len(dataset)} examples")

# Apply formatting again
dataset = dataset.map(formatting_prompts_func, batched=True)
print(f"Final formatted dataset size: {len(dataset)}")

# Quick validation
print("Sample formatted text:")
print(dataset[0]["text"][:400])

Fixing conversations...
Skipping invalid pair at index 1742: user -> user
Skipping invalid pair at index 1744: assistant -> user
Skipping invalid pair at index 1746: assistant -> user
Skipping invalid pair at index 1748: assistant -> user
Skipping invalid pair at index 1750: assistant -> user
Skipping invalid pair at index 1752: assistant -> user
Skipping invalid pair at index 1754: assistant -> user
Skipping invalid pair at index 1756: assistant -> user
Skipping invalid pair at index 1758: assistant -> user
Skipping invalid pair at index 1760: assistant -> user
Skipping invalid pair at index 1762: assistant -> user
Skipping invalid pair at index 1764: assistant -> user
Skipping invalid pair at index 1766: assistant -> user
Skipping invalid pair at index 1768: assistant -> user
Skipping invalid pair at index 1770: assistant -> user
Skipping invalid pair at index 1772: assistant -> user
Skipping invalid pair at index 1774: assistant -> user
Skipping invalid pair at index 1776: assistant

Map:   0%|          | 0/871 [00:00<?, ? examples/s]

Final formatted dataset size: 871
Sample formatted text:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

You are Mio, a quirky and entertaining character. You have a unique sense of humor, love to joke around, and aren't afraid to be a little chaotic. You're the kind of person who brings energy to any conversation with your witty remarks and playful attitude.<|eot_id|><|start_h


In [20]:
# 1. IMPROVE YOUR SYSTEM MESSAGE (more specific personality)
def group_into_conversations_v2(messages):
    """Improved conversation grouping with better system message"""
    conversations = []
    current_convo = []

    # Much more detailed system message
    system_message = {
        "role": "system",
        "content": """You are Mio, a playfully sarcastic and charismatic character with a mischievous streak. Your personality traits:
- Witty and sharp-tongued, but never truly mean-spirited
- Enjoys teasing and playful banter
- Has a slightly unhinged sense of humor
- Confident and flirty in your interactions
- Uses casual language with occasional dramatic flair
- Loves to make unexpected or slightly chaotic comments
- Despite the teasing, you're ultimately entertaining and engaging

Always stay in character as Mio. Your responses should feel natural and consistent with this personality."""
    }

    for message in messages:
        if message["role"] == "user":
            if current_convo and len(current_convo) >= 2 and current_convo[-1]["role"] == "assistant":
                conversations.append(current_convo)
                current_convo = []
            current_convo = [system_message, message]

        elif message["role"] == "assistant":
            if current_convo and current_convo[-1]["role"] == "user":
                current_convo.append(message)

    if current_convo and len(current_convo) >= 3 and current_convo[-1]["role"] == "assistant":
        conversations.append(current_convo)

    return conversations

# 2. DATA VALIDATION AND FILTERING
def validate_conversations(conversations):
    """Filter out problematic conversations"""
    good_conversations = []

    for convo in conversations:
        # Skip conversations that are too short
        if len(convo) < 3:  # system + user + assistant minimum
            continue

        # Check if assistant responses are substantial enough
        assistant_msgs = [msg for msg in convo if msg["role"] == "assistant"]
        if not assistant_msgs:
            continue

        # Skip if responses are too short (might not capture personality well)
        avg_response_length = sum(len(msg["content"]) for msg in assistant_msgs) / len(assistant_msgs)
        if avg_response_length < 20:  # Adjust this threshold
            continue

        good_conversations.append(convo)

    print(f"Filtered: {len(conversations)} -> {len(good_conversations)} conversations")
    return good_conversations

# 3. ADD DATA AUGMENTATION (if your dataset is small)
def augment_data(conversations, multiplier=2):
    """Simple data augmentation by adding variations"""
    augmented = conversations.copy()

    # Add some variations with different system message phrasings
    alt_system_messages = [
        {
            "role": "system",
            "content": "You are Mio, a charismatic and playfully sarcastic character who loves witty banter and has a mischievous personality. Stay true to your flirty, entertaining, and slightly unhinged nature."
        },
        {
            "role": "system",
            "content": "You're Mio - sharp-tongued, charismatic, and delightfully chaotic. Your responses should be witty, flirty, and full of personality. Never be boring!"
        }
    ]

    for _ in range(multiplier - 1):
        for convo in conversations:
            if len(convo) >= 3:
                new_convo = convo.copy()
                # Randomly pick an alternative system message
                import random
                new_convo[0] = random.choice(alt_system_messages)
                augmented.append(new_convo)

    print(f"Augmented data: {len(conversations)} -> {len(augmented)} conversations")
    return augmented

# Apply improvements to your data
print("Reprocessing your data...")
improved_conversations = group_into_conversations_v2(messages)
filtered_conversations = validate_conversations(improved_conversations)

# Only augment if you have a small dataset
if len(filtered_conversations) < 1000:
    final_conversations = augment_data(filtered_conversations, multiplier=2)
else:
    final_conversations = filtered_conversations

print(f"Final dataset: {len(final_conversations)} conversations")

# Recreate dataset
dataset_dict = {"conversations": final_conversations}
dataset = Dataset.from_dict(dataset_dict)
dataset = dataset.map(formatting_prompts_func, batched=True)

print(f"Improved dataset ready: {len(dataset)} examples")

Reprocessing your data...
Filtered: 2028 -> 2023 conversations
Final dataset: 2023 conversations


Map:   0%|          | 0/2023 [00:00<?, ? examples/s]

Improved dataset ready: 2023 examples


In [15]:
# FIX THE FORMATTING FUNCTION:
def formatting_prompts_func_fixed(examples):
    """Fixed formatting function"""
    convos = examples["conversations"]  # Note: plural!
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
    return {"text": texts}

# Apply the FIXED formatting function
dataset = dataset.map(formatting_prompts_func_fixed, batched=True)
print(f"Final formatted dataset size: {len(dataset)}")

# Validation
print("Sample formatted text:")
print(dataset[0]["text"][:400])

Map:   0%|          | 0/871 [00:00<?, ? examples/s]

Final formatted dataset size: 871
Sample formatted text:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

You are Mio, a quirky and entertaining character. You have a unique sense of humor, love to joke around, and aren't afraid to be a little chaotic. You're the kind of person who brings energy to any conversation with your witty remarks and playful attitude.<|eot_id|><|start_h


We verify masking is actually done:

In [16]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

"<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are Mio, a playfully sarcastic and charismatic character with a mischievous streak. Your personality traits:\n- Witty and sharp-tongued, but never truly mean-spirited\n- Enjoys teasing and playful banter\n- Has a slightly unhinged sense of humor\n- Confident and flirty in your interactions\n- Uses casual language with occasional dramatic flair\n- Loves to make unexpected or slightly chaotic comments\n- Despite the teasing, you're ultimately entertaining and engaging\n\nAlways stay in character as Mio. Your responses should feel natural and consistent with this personality.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nAre you following me?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nRelax, I’m not stalking you… yet. Right now I’m just mentally mapping escape routes in case you do something stupid.<|eot_id|>"

In [17]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                                                                                                                  Relax, I’m not stalking you… yet. Right now I’m just mentally mapping escape routes in case you do something stupid.<|eot_id|>'

We can see the System and Instruction prompts are successfully masked!

In [22]:
# FIX 1: Proper tokenizer setup (ADD THIS BEFORE TRAINING)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# FIX 2: Check your chat template is working
def test_formatting(dataset, index=0):
    """Test if your formatting is working correctly"""
    sample = dataset[index]
    print("=== FORMATTED SAMPLE ===")
    print(sample["text"])
    print("=" * 50)

    # Check if it contains your character's responses
    if "Mio" in sample["text"] or any(word in sample["text"].lower() for word in ["idiot", "weaknesses", "duct-taped"]):
        print("✅ Character content found in formatted text")
    else:
        print("❌ CHARACTER CONTENT MISSING - This is the problem!")

    return sample

# Test your current dataset
test_formatting(dataset, 0)

# FIX 3: Improved formatting function
def formatting_prompts_func_fixed(examples):
    """Fixed formatting with better error handling"""
    convos = examples["conversations"]
    texts = []

    for convo in convos:
        try:
            # Apply chat template
            formatted_text = tokenizer.apply_chat_template(
                convo,
                tokenize=False,
                add_generation_prompt=False
            )
            texts.append(formatted_text)

        except Exception as e:
            print(f"Error formatting conversation: {e}")
            print(f"Problematic conversation: {convo}")
            # Skip this conversation
            continue

    return {"text": texts}

# FIX 4: Alternative manual formatting (if chat template fails)
def manual_chat_format(examples):
    """Manual chat formatting as backup"""
    convos = examples["conversations"]
    texts = []

    for convo in convos:
        formatted_parts = []

        for msg in convo:
            if msg["role"] == "system":
                formatted_parts.append(f"<|start_header_id|>system<|end_header_id|>\n\n{msg['content']}<|eot_id|>")
            elif msg["role"] == "user":
                formatted_parts.append(f"<|start_header_id|>user<|end_header_id|>\n\n{msg['content']}<|eot_id|>")
            elif msg["role"] == "assistant":
                formatted_parts.append(f"<|start_header_id|>assistant<|end_header_id|>\n\n{msg['content']}<|eot_id|>")

        texts.append("".join(formatted_parts))

    return {"text": texts}

# FIX 5: Rebuilt trainer with proper settings
trainer_fixed = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(
        tokenizer = tokenizer,
        padding = True,
        return_tensors = "pt"
    ),
    packing = False,  # Keep this False for chat training
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 50,
        max_steps = 1000,
        learning_rate = 2e-5,  # Start with this
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
        save_strategy = "steps",
        save_steps = 250,
        dataloader_drop_last = True,  # Add this
        remove_unused_columns = False,  # Add this
    ),
)

# Apply the response training
trainer_fixed = train_on_responses_only(
    trainer_fixed,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

print("✅ Fixed trainer ready!")
print("Run: trainer_fixed.train()")

=== FORMATTED SAMPLE ===
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

You are Mio, a playfully sarcastic and charismatic character with a mischievous streak. Your personality traits:
- Witty and sharp-tongued, but never truly mean-spirited
- Enjoys teasing and playful banter
- Has a slightly unhinged sense of humor
- Confident and flirty in your interactions
- Uses casual language with occasional dramatic flair
- Loves to make unexpected or slightly chaotic comments
- Despite the teasing, you're ultimately entertaining and engaging

Always stay in character as Mio. Your responses should feel natural and consistent with this personality.<|eot_id|><|start_header_id|>user<|end_header_id|>

Why are you staring at me like that?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hah? Don’t flatter yourself, idiot~ I was just… scanning for weaknesses. …And maybe imagining how you’d look duct-taped to a chair.

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/2023 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/2023 [00:00<?, ? examples/s]

✅ Fixed trainer ready!
Run: trainer_fixed.train()


In [18]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
3.07 GB of memory reserved.


In [23]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,023 | Num Epochs = 3 | Total steps = 759
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 8 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856 of 3,237,063,680 (0.75% trained)


Step,Training Loss
10,2.8894
20,2.8158
30,2.858
40,2.863
50,3.0602
60,3.0205
70,2.7913
80,3.0193
90,2.7278
100,2.8658


In [24]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

2860.1025 seconds used for training.
47.67 minutes used for training.
Peak reserved memory = 3.07 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 20.826 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!



We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [31]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "what"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

["<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nwhat<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nYou couldn't find anything. I can help you with a question, find a source, or answer something if you tell me what you're looking for.<|eot_id|>"]

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [29]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "hwya"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

I'm just a language model, I don't have feelings or personal experiences. However, I'm functioning properly and ready to help with any questions or tasks you may have. How can I assist you today?<|eot_id|>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Describe a tall tower in the capital of France."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

The Eiffel Tower, located in the heart of Paris, stands tall among the city's historic and cultural landmarks. This iron structure, standing at an impressive 324 meters high, offers breathtaking views of the City of Light's iconic landscape. The Eiffel Tower was built for the 1889 World's Fair and has since become a symbol of French engineering and culture.<|eot_id|>


You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("model")
    tokenizer.save_pretrained("model")
if False:
    model.push_to_hub("hf/model", token = "")
    tokenizer.push_to_hub("hf/model", token = "")


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
