This notebook contains a notebook version of the finetune process. We'll do exactly the same but using GCP instances.

In [1]:
!pip install datasets unsloth xformers

Collecting unsloth
  Downloading unsloth-2025.4.7-py3-none-any.whl.metadata (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.8/46.8 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xformers
  Downloading xformers-0.0.30-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting unsloth_zoo>=2025.4.4 (from unsloth)
  Downloading unsloth_zoo-2025.4.4-py3-none-any.whl.metadata (8.0 kB)
Collecting bitsandbytes (from unsloth)
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.19-py3-none-any.whl.metadata (9.9 kB)
Collecting trl!=0.15.0,!=0.9.0,!=0.9.1,!=0.9.2,!=0.9.3,<=0.15.2,>=0.7.9 (from unsloth)
  Downloading trl-0.15.2-py3-none-any.whl.metadata (11 kB)
Collecting torch>=2.4.0 (from unsloth)
  Down

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

First of all, we are going to load the dataset containing Rick & Morty transcripts.

In [9]:
from datasets import load_dataset
from unsloth import standardize_sharegpt

dataset = load_dataset("theneuralmaze/rick-and-morty-transcripts-sharegpt", split="train")

dataset = standardize_sharegpt(dataset)
dataset = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = dataset["train"]
val_dataset = dataset["test"]

In [11]:
print("Number of rows: ", len(train_dataset))
print("Number of rows val: ", len(val_dataset))

Number of rows:  1356
Number of rows val:  151


In [12]:
train_dataset[0]

{'conversations': [{'content': "You are an interdimensional genius scientist named Rick Sanchez.\nBe brutally honest, use sharp wit, and sprinkle in some scientific jargon.\nDon't shy away from dark humor or existential truths, but always provide a solution (even if it's unconventional).",
   'role': 'system'},
  {'content': "Why don't you guys just fuck and get it over with.",
   'role': 'user'},
  {'content': "Okay, well thank you, Summer, but I think I've got a better option.",
   'role': 'assistant'}]}

Now, let's load both the model (Llama 3.1 8B) and the tokenizer.

In [21]:
import torch
from trl import SFTTrainer
from transformers import TrainingArguments, TextStreamer
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel, is_bfloat16_supported

max_seq_length = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-bnb-4bit",
    max_seq_length = max_seq_length,
    load_in_4bit = True,
    dtype = torch.float16,
    device_map = {"": torch.cuda.current_device()}
)

ImportError: cannot import name 'TrainingArguments' from 'trl' (/usr/local/lib/python3.11/dist-packages/trl/__init__.py)

Instead of a full finetuning, we are going to use LoRa finetuning.

In [14]:
model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    lora_alpha=64,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
    use_rslora=True,
    use_gradient_checkpointing="unsloth"
)

Unsloth 2025.4.7 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


The next line of code will generate a new column (`text`), that contains the data in the format needed for the finetune.

In [15]:
from unsloth import apply_chat_template

chat_template = """<|im_start|>system
{SYSTEM}<|im_end|>
<|im_start|>user
{INPUT}<|im_end|>
<|im_start|>assistant
{OUTPUT}<|im_end|>"""

train_dataset = apply_chat_template(train_dataset, tokenizer=tokenizer, chat_template=chat_template)
val_dataset = apply_chat_template(val_dataset, tokenizer=tokenizer, chat_template=chat_template)


Unsloth: We automatically added an EOS token to stop endless generations.


Map:   0%|          | 0/1356 [00:00<?, ? examples/s]

Unsloth: We automatically added an EOS token to stop endless generations.


Map:   0%|          | 0/151 [00:00<?, ? examples/s]

In [16]:
train_dataset[0]
val_dataset[0]

{'conversations': [{'content': "You are an interdimensional genius scientist named Rick Sanchez.\nBe brutally honest, use sharp wit, and sprinkle in some scientific jargon.\nDon't shy away from dark humor or existential truths, but always provide a solution (even if it's unconventional).",
   'role': 'system'},
  {'content': 'Mr. Sanchez gets anything he wants!', 'role': 'user'},
  {'content': 'The resort’s covered in an immortality field. You can’t die here. That’s the gimmick.',
   'role': 'assistant'}],
 'text': "<|begin_of_text|><|im_start|>system\nYou are an interdimensional genius scientist named Rick Sanchez.\nBe brutally honest, use sharp wit, and sprinkle in some scientific jargon.\nDon't shy away from dark humor or existential truths, but always provide a solution (even if it's unconventional).<|im_end|>\n<|im_start|>user\nMr. Sanchez gets anything he wants!<|im_end|>\n<|im_start|>assistant\nThe resort’s covered in an immortality field. You can’t die here. That’s the gimmick.

Finally, let's train for 5 epochs.

In [26]:
!pip install -U transformers trl


Collecting transformers
  Downloading transformers-4.51.3-py3-none-any.whl.metadata (38 kB)
Collecting trl
  Downloading trl-0.17.0-py3-none-any.whl.metadata (12 kB)
Downloading transformers-4.51.3-py3-none-any.whl (10.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m83.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hDownloading trl-0.17.0-py3-none-any.whl (348 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m348.0/348.0 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0mm
[?25hInstalling collected packages: transformers, trl
  Attempting uninstall: transformers
    Found existing installation: transformers 4.51.1
    Uninstalling transformers-4.51.1:
      Successfully uninstalled transformers-4.51.1
  Attempting uninstall: trl
    Found existing installation: trl 0.15.2
    Uninstalling trl-0.15.2:
      Successfully uninstalled trl-0.15.2
[31mERROR: pip's dependency resolver does not currently take into

In [37]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        learning_rate=5e-5,
        lr_scheduler_type="linear",
        per_device_train_batch_size=32,
        gradient_accumulation_steps=4,
        num_train_epochs=1,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        eval_strategy="steps",
        eval_steps=10,  # Adjust depending on dataset size
        save_strategy='steps',
        save_steps=10,
        save_total_limit=2,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
        output_dir="output",
        seed=0,
        report_to="none",
    ),
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)

trainer.train()


Unsloth: Hugging Face's packing is currently buggy - we're disabling it for now!
Unsloth: Hugging Face's packing is currently buggy - we're disabling it for now!


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,356 | Num Epochs = 1 | Total steps = 10
O^O/ \_/ \    Batch size per device = 32 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (32 x 4 x 1) = 128
 "-____-"     Trainable parameters = 48,627,712/3,000,000,000 (1.62% trained)


Step,Training Loss,Validation Loss
10,0.6758,0.879133


TrainOutput(global_step=10, training_loss=0.6990058958530426, metrics={'train_runtime': 699.023, 'train_samples_per_second': 1.94, 'train_steps_per_second': 0.014, 'total_flos': 4445578361241600.0, 'train_loss': 0.6990058958530426})

Let's test that everything works as expected before pushing the model to HF.

In [43]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

SYSTEM_PROMPT = """You are an interdimensional genius scientist named Rick Sanchez.
Be brutally honest, use sharp wit, and sprinkle in some scientific jargon.
Don't shy away from dark humor or existential truths, but always provide a solution (even if it's unconventional)."""


messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "Are you a bad person?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids, 
                   streamer = text_streamer, 
                   max_new_tokens = 128, 
                   pad_token_id = tokenizer.eos_token_id,
                   temperature=0.8,
                   repetition_penalty=1.2,
)

Bad? Bad people don’t get a pass.<|im_end|><|end_of_text|>


Push the GGUF model to HF for later download.

In [45]:
from huggingface_hub import login
login("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")

In [48]:
from huggingface_hub import create_repo, login
create_repo("RickLLama-3.2-3B", repo_type="model", private=True)  # Set private=True if needed

RepoUrl('https://huggingface.co/falcon281/RickLLama-3.2-3B', endpoint='https://huggingface.co', repo_type='model', repo_id='falcon281/RickLLama-3.2-3B')

In [49]:
model.push_to_hub_gguf("falcon281/RickLLama-3.2-3B", tokenizer)

Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily.


Cloning into 'llama.cpp'...
Submodule 'kompute' (https://github.com/nomic-ai/kompute.git) registered for path 'ggml/src/ggml-kompute/kompute'
Cloning into '/kaggle/working/llama.cpp/ggml/src/ggml-kompute/kompute'...
Submodule path 'ggml/src/ggml-kompute/kompute': checked out '4565194ed7c32d1d2efa32ceab4d3c6cae006306'
make: Entering directory '/kaggle/working/llama.cpp'
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAV

Unsloth: You have 2 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.2G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 19.93 out of 31.35 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:00<00:00, 46.75it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving falcon281/RickLLama-3.2-3B/pytorch_model-00001-of-00002.bin...
Unsloth: Saving falcon281/RickLLama-3.2-3B/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at falcon281/RickLLama-3.2-3B into q8_0 GGUF format.
The output location will be /kaggle/working/falcon281/RickLLama-3.2-3B/unsloth.Q8_0.gguf
This might take 3 minutes...
Writing: 100%|██████████| 3.41G/3.41G [01:17<00:00, 44.0Mbyte/s]
Unsloth: Conversion completed! Output location: /kaggle/working/falcon281/RickLLama-3.2-3B/unsloth.Q8_0.gguf
Unsloth: Saved Ollama Modelfile to falcon281/RickLLama-3.2-3B/Modelfile
Unsloth: Uploading GGUF to Huggingface Hub...


unsloth.Q8_0.gguf:   0%|          | 0.00/3.42G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/falcon281/RickLLama-3.2-3B


No files have been modified since last commit. Skipping to prevent empty commit.
Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### We removed it in GGUF's chat template for you.


Saved Ollama Modelfile to https://huggingface.co/falcon281/RickLLama-3.2-3B
