In [None]:
!pip install colab-xterm #https://pypi.org/project/colab-xterm/
%load_ext colabxterm

In [None]:
%xterm
 # curl https://ollama.ai/install.sh | sh
 # ollama serve & 
 # ollama pull llama3.1

Initial query

In [1]:
import requests
import json

url = "http://localhost:11434/api/chat"
payload = {
    "model": "llama3.1",
    "messages": [{"role": "user", "content": "Give me some Unsloth.ai News?"}],
    "stream": False
}
response = requests.post(url, data=json.dumps(payload))
print(response.json())

{'model': 'llama3.1', 'created_at': '2025-07-06T23:54:24.498315152Z', 'message': {'role': 'assistant', 'content': 'There is no such thing as "Unsloth.ai" news. It appears to be a made-up or fictional term.\n\nIf you\'re looking for information on AI, technology, or industry-related news, I\'d be happy to provide updates and insights from reputable sources like:\n\n* Google\'s AI research\n* Meta AI (formerly Facebook AI)\n* Microsoft AI\n* OpenAI\n* Research papers and publications in the field of artificial intelligence\n\nLet me know if there\'s something specific you\'re interested in, and I\'ll do my best to help!'}, 'done_reason': 'stop', 'done': True, 'total_duration': 6440074865, 'load_duration': 5076463525, 'prompt_eval_count': 18, 'prompt_eval_duration': 208438417, 'eval_count': 112, 'eval_duration': 1153440250}


Now let's try the same command with Python.

First we need to install the Ollama library for Python

In [1]:
!pip install ollama

Looking in indexes: https://pypi.org/simple/
Collecting ollama
  Using cached ollama-0.5.1-py3-none-any.whl.metadata (4.3 kB)
Collecting pydantic>=2.9 (from ollama)
  Using cached pydantic-2.11.7-py3-none-any.whl.metadata (67 kB)
Collecting annotated-types>=0.6.0 (from pydantic>=2.9->ollama)
  Using cached annotated_types-0.7.0-py3-none-any.whl.metadata (15 kB)
Collecting pydantic-core==2.33.2 (from pydantic>=2.9->ollama)
  Using cached pydantic_core-2.33.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting typing-inspection>=0.4.0 (from pydantic>=2.9->ollama)
  Using cached typing_inspection-0.4.1-py3-none-any.whl.metadata (2.6 kB)
Using cached ollama-0.5.1-py3-none-any.whl (13 kB)
Using cached pydantic-2.11.7-py3-none-any.whl (444 kB)
Using cached pydantic_core-2.33.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
Using cached annotated_types-0.7.0-py3-none-any.whl (13 kB)
Using cached typing_inspection-0.4.1-py3-none-any.whl 

Next let's create a simple python query using the Ollama library

In [3]:
import ollama
response = ollama.chat(
    model="llama3.1",
    messages=[
        {"role": "user", "content": "Give me some Unsloth.ai News?"},
    ],
)
print(response["message"]["content"])

I couldn't find any information on "Unsloth.ai". It's possible that it's a fictional company, or it may be a very new and unknown entity. Can you provide more context or clarify what Unsloth.ai is? I'll do my best to help.

However, if you're interested in news related to AI, machine learning, or technology, I'd be happy to share some general updates or trends with you!


As you can see unsloth.ai causes some hallucinations with llama3.1 and the results can be quite humorous. We will want to fix this by fine tuning the model and give it some information about the Unsloth project. 

The first step in doing this is to convert the provided unsloth_documentation.pdf into chunks using the PyPDFLoader library from LangChain. To use LangChain we will need to install the dependencies.

In [2]:
!pip install langchain langchain_community pypdf
!wget https://raw.githubusercontent.com/Brian-McGinn/Fine-Tuning-Tutorial/main/unsloth_documentation.pdf

Looking in indexes: https://pypi.org/simple/
Collecting langchain
  Using cached langchain-0.3.26-py3-none-any.whl.metadata (7.8 kB)
Collecting langchain_community
  Using cached langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting pypdf
  Using cached pypdf-5.7.0-py3-none-any.whl.metadata (7.2 kB)
Collecting langchain-core<1.0.0,>=0.3.66 (from langchain)
  Using cached langchain_core-0.3.68-py3-none-any.whl.metadata (5.8 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.8 (from langchain)
  Using cached langchain_text_splitters-0.3.8-py3-none-any.whl.metadata (1.9 kB)
Collecting langsmith>=0.1.17 (from langchain)
  Using cached langsmith-0.4.4-py3-none-any.whl.metadata (15 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Using cached sqlalchemy-2.0.41-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain_community)
  Using cached aiohttp-3.12.13-cp312-cp312-manylinux_2_17_x86_64.manylinux

With everything installed we can split our PDF into chunks

In [6]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("unsloth_documentation.pdf")
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000, 
            chunk_overlap=50
        )
split_docs = text_splitter.split_documents(docs)
print(split_docs[0])

page_content='Finetune Gemma 3n, Qwen3, Llama 4, Phi-4 & Mistral 2x faster with
80% less VRAM!
Finetune for Free
Notebooks are beginner friendly. Read our guide. Add your dataset, click “Run
All”, and export your finetuned model to GGUF, Ollama, vLLM or Hugging
Face.
Unsloth supports Free Notebooks Performance Memory use
Gemma 3n (4B) Start for free 1.5x faster 50% less
Qwen3 (14B) Start for free 2x faster 70% less
Qwen3 (4B):
GRPO
Start for free 2x faster 80% less
Gemma 3 (4B) Start for free 1.6x faster 60% less
Llama 3.2 (3B) Start for free 2x faster 70% less
Phi-4 (14B) Start for free 2x faster 70% less
Llama 3.2 Vision
(11B)
Start for free 2x faster 50% less
Llama 3.1 (8B) Start for free 2x faster 70% less
Mistral v0.3 (7B) Start for free 2.2x faster 75% less
Orpheus-TTS
(3B)
Start for free 1.5x faster 50% less
• See all our notebooks for: Kaggle, GRPO,TTS & Vision
• See all our models and all our notebooks
• See detailed documentation for Unsloth here
Quickstart' metadata={'produc

Now that we have the PDF chunks we can use ollama to generate data for our fine tuning

In [28]:
import json
from typing import List 
from pydantic import BaseModel
import ollama

def prompt_template(data: str, num_records: int = 5):

    return f"""You are an expert data curator assisting a machine learning engineer in creating a high-quality instruction tuning dataset. Your task is to transform 
    the provided data chunk into diverse question and answer (Q&A) pairs that will be used to fine-tune a language model. 

    For each of the {num_records} entries, generate one or two well-structured questions that reflect different aspects of the information in the chunk. 
    Ensure a mix of longer and shorter questions, with shorter ones typically containing 1-2 sentences and longer ones spanning up to 3-4 sentences. Each 
    Q&A pair should be concise yet informative, capturing key insights from the data.

    Structure your output in JSON format, where each object contains 'question' and 'answer' fields. The JSON structure should look like this:

        "question": "Your question here...",
        "answer": "Your answer here..."

    Focus on creating clear, relevant, and varied questions that encourage the model to learn from diverse perspectives. Avoid any sensitive or biased 
    content, ensuring answers are accurate and neutral.

    Example:
    
        "question": "What is the primary purpose of this dataset?",
        "answer": "This dataset serves as training data for fine-tuning a language model."
    

    By following these guidelines, you'll contribute to a robust and effective dataset that enhances the model's performance."

    ---

    **Explanation:**

    - **Clarity and Specificity:** The revised prompt clearly defines the role of the assistant and the importance of the task, ensuring alignment with the 
    project goals.
    - **Quality Standards:** It emphasizes the need for well-formulated Q&A pairs, specifying the structure and content of each question and answer.
    - **Output Format:** An example JSON structure is provided to guide the format accurately.
    - **Constraints and Biases:** A note on avoiding sensitive or biased content ensures ethical considerations are met.
    - **Step-by-Step Guidance:** The prompt breaks down the task into manageable steps, making it easier for the assistant to follow.

    This approach ensures that the generated data is both high-quality and meets the specific requirements of the machine learning project.
    
    Data
    {data}
    """

class Record(BaseModel):
    question: str
    answer: str

class Response(BaseModel):
    generated: List[Record]

def llm_call(data: str, num_records: int = 5) -> dict:
    response = ollama.generate(
        model="llama3.1",
        prompt=prompt_template(data, num_records),
        options={
            "num_predict": 2000
        },
        format=Response.model_json_schema(),
    )
    return json.loads(response['response'])

dataset = []
for i, chunk in enumerate(split_docs):
    data = llm_call(chunk)
    print(data)
    for pair in data['generated']:
        print(pair)
        dataset.append({
                'question': pair['question'],
                'answer': pair['answer']
            })
tuning_data = 'unsloth_data.json'
with open(tuning_data,'w') as f: 
    json.dump(dataset, f) 

print(f"Done writing {tuning_data} to system")

{'generated': [{'question': "What are the benefits of using Unsloth's Notebooks for fine-tuning language models?", 'answer': "Unsloth's Notebooks offer a faster and more memory-efficient way to finetune language models, with some models being 2x faster and requiring up to 80% less VRAM."}, {'question': 'Which Unsloth Notebooks support free usage for beginners?', 'answer': "Unsloth's Notebooks are beginner-friendly, and users can start using them for free, allowing them to finetune their models without any additional cost."}, {'question': 'What is the difference in performance between Gemma 3n and Qwen3 when it comes to fine-tuning language models?', 'answer': 'Gemma 3n offers 1.5x faster performance compared to Qwen3, while also requiring 50% less VRAM.'}, {'question': "Can I export my finetuned model to various platforms using Unsloth's Notebooks?", 'answer': "Yes, users can export their finetuned models to GGUF, Ollama, vLLM, or Hugging Face using Unsloth's Notebooks."}, {'question':

Now that we have a set of training data we are ready to train!

Before doing however we need to install unsloth to help speedup the performance the reduce the RAM required for training.

In [3]:
!uv pip install unsloth vllm

[2mUsing Python 3.12.3 environment at: venv[0m
[2K[2mResolved [1m85 packages[0m [2min 372ms[0m[0m                                        [0m
[2K[2mPrepared [1m41 packages[0m [2min 43.07s[0m[0m                                           
[2K[2mInstalled [1m61 packages[0m [2min 176ms[0m[0m                              [0m
 [32m+[39m [1maccelerate[0m[2m==1.8.1[0m
 [32m+[39m [1mbitsandbytes[0m[2m==0.46.1[0m
 [32m+[39m [1mcut-cross-entropy[0m[2m==25.1.1[0m
 [32m+[39m [1mdatasets[0m[2m==3.6.0[0m
 [32m+[39m [1mdiffusers[0m[2m==0.34.0[0m
 [32m+[39m [1mdill[0m[2m==0.3.8[0m
 [32m+[39m [1mdocstring-parser[0m[2m==0.16[0m
 [32m+[39m [1mfilelock[0m[2m==3.18.0[0m
 [32m+[39m [1mfsspec[0m[2m==2025.3.0[0m
 [32m+[39m [1mhf-transfer[0m[2m==0.1.9[0m
 [32m+[39m [1mhf-xet[0m[2m==1.1.5[0m
 [32m+[39m [1mhuggingface-hub[0m[2m==0.33.2[0m
 [32m+[39m [1mimportlib-metadata[0m[2m==8.7.0[0m
 [32m+[39m [1mmarkdown-

You will also need PyTorch

In [1]:
!pip install torch torchvision torchaudio
!uv install -U xformers
!uv pip install flash-attn --no-build-isolation
!uv pip install flashinfer-python

Looking in indexes: https://download.pytorch.org/whl/nightly/cu128
[1m[31merror:[0m unrecognized subcommand '[33minstall[0m'

  [32mtip:[0m a similar subcommand exists: '[32muv pip install[0m'

[1m[32mUsage:[0m [1m[36muv[0m [36m[OPTIONS][0m [36m<COMMAND>[0m

For more information, try '[1m[36m--help[0m'.


Validate the the device is found.

In [40]:
import torch
use_cuda = torch.cuda.is_available()
print("CUDA available:", use_cuda)
if use_cuda:
    device = torch.device("cuda")
    print(f"Using device: {device}")
    print(torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print("Using device: CPU")

CUDA available: True
Using device: cuda
NVIDIA GeForce RTX 5070


How let's setup our tokenizer using unsloth

In [3]:
from unsloth import FastLanguageModel

max_seq_length = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None
)

prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
"You are a helpful, honest and harmless assitant designed to help engineers. Think through each question logically and provide an answer. Don't make things up, if you're unable to answer a question advise the user that you're unable to answer as it is outside of your scope.

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    inputs       = examples["question"]
    outputs      = examples["answer"]
    texts = []
    for input, output in zip(inputs, outputs):
        text = prompt_style.format(input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

from datasets import load_dataset
dataset = load_dataset("json", split='train', data_files='unsloth_data.json')
dataset = dataset.map(
    formatting_prompts_func,
    batched=True,
)
print(dataset["text"][0])

model = FastLanguageModel.get_peft_model(
    model,
    r=128,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=256,
    lora_dropout=0, 
    bias="none", 
   
    use_gradient_checkpointing="unsloth", 
    random_state=3407,
    use_rslora=False, 
    loftq_config=None,
)

==((====))==  Unsloth 2025.6.12: Fast Llama patching. Transformers: 4.53.1. vLLM: 0.9.0.
   \\   /|    NVIDIA GeForce RTX 5070. Num GPUs = 1. Max memory: 11.94 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Map:   0%|          | 0/117 [00:00<?, ? examples/s]

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
"You are a helpful, honest and harmless assitant designed to help engineers. Think through each question logically and provide an answer. Don't make things up, if you're unable to answer a question advise the user that you're unable to answer as it is outside of your scope.

### Input:
What are the models that can be fine-tuned with the provided data?

### Response:
The data supports fine-tuning of Gemma 3n, Qwen3, Llama 4, Phi-4 & Mistral<|end_of_text|>


Unsloth 2025.6.12 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Now that we have the tokenizer we can setup the prompt for fine tuning.

In [4]:

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    num_train_epochs = 10,
    args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 25,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        report_to="none",
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)



Unsloth: Tokenizing ["text"]:   0%|          | 0/117 [00:00<?, ? examples/s]

The next step is to map our dataset to the template

In [5]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 117 | Num Epochs = 13 | Total steps = 100
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 335,544,320 of 8,000,000,000 (4.19% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,3.1092
2,3.1641
3,2.6256
4,1.736
5,1.1455
6,0.8678
7,0.8241
8,0.9402
9,0.7582
10,0.6184


TrainOutput(global_step=100, training_loss=0.2667884164303541, metrics={'train_runtime': 268.0753, 'train_samples_per_second': 5.968, 'train_steps_per_second': 0.373, 'total_flos': 1.1018697183535104e+16, 'train_loss': 0.2667884164303541})

Save the new model

In [6]:
new_model_local = "Llama-3.1-8B-unsloth"
model.save_pretrained(new_model_local) # Local saving
tokenizer.save_pretrained(new_model_local) # Local saving

('Llama-3.1-8B-unsloth/tokenizer_config.json',
 'Llama-3.1-8B-unsloth/special_tokens_map.json',
 'Llama-3.1-8B-unsloth/tokenizer.json')

host in local ollama

In [9]:
!wget https://raw.githubusercontent.com/Brian-McGinn/Fine-Tuning-Tutorial/main/Modelfile
!ollama create unsloth-trained 

[?2026h[?25l[1Ggathering model components ⠙ [K[?25h[?2026l

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[?2026h[?25l[1Ggathering model components ⠹ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠸ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠼ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠴ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠴ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠧ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠧ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠏ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠋ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components [K
copying file sha256:10bf4fc984f53fc830904de79d5d88d2ed020951f0cfb9d7471d0b594de7d1af 100% [K
copying file sha256:32f404d626cf7b1b6eea36c241ae7cbd6ec29c777a8701ea48c0fe9ebb94c9b1 100% [K
copying file sha256:52716f60c3ad328509fa37cdded9a2f1196ecae463f5480f5d38c66a25e7a7dc 100% [K
copying file sha256:43ebb808e1ff9c5db32e112d1213e0ce0b78162a6eb2436aeee361cb88ae847b 0% ⠋ [K

In [10]:
# Confirm ollama loaded the model correctly
!ollama list

NAME                         ID              SIZE      MODIFIED       
unsloth-trained:latest       45b921c735c8    5.6 GB    38 seconds ago    
unsloth-trained-v2:latest    dc225e3f3634    5.6 GB    47 hours ago      
tm1-trained:latest           22f030dbb978    5.6 GB    2 days ago        
qwen2.5:14b                  7cdf5a0187d5    9.0 GB    2 days ago        
llama3.1:latest              46e0c10c039e    4.9 GB    3 days ago        
deepseek-r1:1.5b             e0979632db5a    1.1 GB    4 days ago        


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Now let's do some cleanup

In [None]:
import gc
gc.collect()
import torch
torch.cuda.empty_cache()

This will require an ollama reboot

In [None]:
%xterm
 # ollama serve & 

Call latest model from ollama

In [11]:
import ollama
response = ollama.chat(
    model="unsloth-trained",
    messages=[
        {"role": "user", "content": "Give me some Unsloth.ai News?"},
    ],
)
print(response["message"]["content"])

Here are the latest updates from Unsloth:

### Model Updates:
Unsloth has released a new model, Llama 3.2 (8B), which offers improved performance and capabilities.

### Featured Articles:
Check out these articles for in-depth information on topics like AI safety, ethics, and cutting-edge research:

1. "Philosophical Foundations of Deep Learning: A Review" - This article explores the philosophical underpinnings of deep learning and its potential implications.
2. "The Ethics of Creating Life: A Discussion on Llama 3.2 (7B) and Beyond" - This piece delves into the ethical considerations surrounding the development of advanced AI models like Llama 3.2 (7B).

### Community News:
Join the conversation by participating in discussions related to these topics:

1. What are some potential applications of Llama 3.2 (8B) in industries like healthcare and finance?
2. How can users contribute to the development of Unsloth's models through their own projects and research?


Results inconsistent let's try rag

In [17]:
!pip install pypdf

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: https://pypi.org/simple/


In [19]:
import ollama
from pypdf  import PdfReader
import os

# Load and extract text from the PDF
def load_pdf_content(pdf_path):
    if not os.path.exists(pdf_path):
        raise FileNotFoundError(f"PDF file not found: {pdf_path}")
    
    reader = PdfReader(pdf_path)
    text_content = ""
    for page in reader.pages:
        text_content += page.extract_text() + "\n"
    return text_content

# Load the unsloth documentation
pdf_content = load_pdf_content("unsloth_documentation.pdf")

# Create the RAG-enhanced prompt
rag_prompt = f"""Based on the following documentation about Unsloth.ai, please answer the user's question:

Documentation:
{pdf_content}

User Question: Give me some Unsloth.ai News?

Please provide a comprehensive answer based on the documentation provid d."""

response = ollama.chat(
    model="llama3.1",
    messages=[
        {"role": "user", "content": rag_prompt},
    ],
)
print(response["message"]["content"])

Based on the provided documentation, here's a comprehensive summary of Unsloth.ai news:

**Recent Developments**

* The Unsloth team has made significant progress in optimizing language models for long context finetuning workloads. Their solution, called Unsloth, allows users to fine-tune large language models with minimal computational resources.
* Unsloth has been integrated with Hugging Face's Transformers Library (TRL) and supports a wide range of language models, including Llama-3.1, Meta-Llama-3.2, and Phi-3.5.

**Performance Benchmarks**

* The team has conducted extensive benchmarking tests using the Alpaca Dataset and reported significant improvements in performance compared to Hugging Face's implementation.
* Unsloth achieves a 2x speedup and reduces VRAM usage by >75% while maintaining comparable results to Hugging Face's implementation.
* The team has also released detailed benchmarks for Llama-3.3 (70B) on the Alpaca Dataset, demonstrating its ability to handle long contex