Initial query

In [10]:
import requests
import json

url = "http://localhost:11434/api/chat"
payload = {
    "model": "llama3.1",
    "messages": [{"role": "user", "content": "Give me some Unsloth.ai News?"},}],
    "stream": False
}
response = requests.post(url, data=json.dumps(payload))
print(response.json())

{'model': 'llama3.1', 'created_at': '2025-07-05T18:12:06.121178506Z', 'message': {'role': 'assistant', 'content': 'I couldn\'t find any information on "unsloth". It\'s possible that it\'s a made-up word or not a widely recognized term.\n\nHowever, I\'m assuming you may be thinking of sloth, which is a slow-moving mammal that lives in the tropical rainforests of Central and South America. If that\'s the case, here\'s 100 words about sloths: Sloths are arboreal mammals that spend most of their time hanging upside down from trees. They have a low metabolism, which means they don\'t need to eat much or move around much, allowing them to conserve energy in their dense rainforest habitats.'}, 'done_reason': 'stop', 'done': True, 'total_duration': 1458366621, 'load_duration': 19558549, 'prompt_eval_count': 22, 'prompt_eval_duration': 116783676, 'eval_count': 128, 'eval_duration': 1321612281}


Now let's do it in python. 

First we need to install ollama with pip

In [None]:
!pip install ollama

Next let's create a simple python query using the ollama library

In [8]:
import ollama
response = ollama.chat(
    model="llama3.1",
    messages=[
        {"role": "user", "content": "Give me some Unsloth.ai News?"},
    ],
)
print(response["message"]["content"])

I'm a large language model, I don't have any information about "Unsloth.ai" as it doesn't seem to be a real or well-known entity. It's possible that you may be thinking of Sloth AI, which is a hypothetical concept rather than an actual AI system.

However, I can provide some information on sloths and their biology if you'd like!

If you meant something else entirely, please let me know what Unsloth.ai is supposed to be, or give me any additional context about it.


As you can see unsloth is an unown term to the default deepseek model and the results can be quite humorous. We will want to fix this by fine tuning the model and give it some information about the unsloth library. 
The first step in doing this is to convert the provided unsloth_documentation.pdf into a dataset for training.
For this task we can use Docling and LiteLLM.
To help visualize different outputs we'll also install colorama to color code out terminal outputs

In [None]:
!pip install docling litellm

Most Likely you will want to run this on GPU so you'll need to install pytorch as well. I used the latest version with CUDA 12.8 since I have a 5000 series GPU which is not supported by earlier CUDA versions. Please check version [compatability](https://developer.nvidia.com/cuda-gpus)

In [None]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

You may also might need to update huggingface cache persmissions and pre-download the docling-models.

In [None]:
sudo chown -R $(whoami) ~/.cache/huggingface
huggingface-cli download ds4sd/docling-models

Now We can being getting out data chunks from the pdf and save them to a list of chunks

In [3]:
import warnings
warnings.filterwarnings('ignore')

from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker

converter = DocumentConverter()
doc = converter.convert("unsloth_documentation.pdf").document
chunker = HybridChunker()
chunks = chunker.chunk(dl_doc=doc)

contextualized_chunks = []
for i, chunk in enumerate(chunks): 
    # print( f"Raw Text:\n{chunk.text[:300]}…" )
    contextualized_chunks.append(chunker.contextualize(chunk=chunk))
    # print(f"Contextualized Text:\n{contextualized_chunks[i][:300]}…")

Token indices sequence length is longer than the specified maximum sequence length for this model (991 > 512). Running this sequence through the model will result in indexing errors


Now that we have contextualize chunks we can use ollama to generate data for our fine tuning

In [4]:
import json
from typing import List 
from pydantic import BaseModel
from litellm import completion

def prompt_template(data: str, num_records: int = 5):

    return f"""You are an expert data curator assisting a machine learning engineer in creating a high-quality instruction tuning dataset. Your task is to transform 
    the provided data chunk into diverse question and answer (Q&A) pairs that will be used to fine-tune a language model. 

    For each of the {num_records} entries, generate one or two well-structured questions that reflect different aspects of the information in the chunk. 
    Ensure a mix of longer and shorter questions, with shorter ones typically containing 1-2 sentences and longer ones spanning up to 3-4 sentences. Each 
    Q&A pair should be concise yet informative, capturing key insights from the data.

    Structure your output in JSON format, where each object contains 'question' and 'answer' fields. The JSON structure should look like this:

        "question": "Your question here...",
        "answer": "Your answer here..."

    Focus on creating clear, relevant, and varied questions that encourage the model to learn from diverse perspectives. Avoid any sensitive or biased 
    content, ensuring answers are accurate and neutral.

    Example:
    
        "question": "What is the primary purpose of this dataset?",
        "answer": "This dataset serves as training data for fine-tuning a language model."
    

    By following these guidelines, you'll contribute to a robust and effective dataset that enhances the model's performance."

    ---

    **Explanation:**

    - **Clarity and Specificity:** The revised prompt clearly defines the role of the assistant and the importance of the task, ensuring alignment with the 
    project goals.
    - **Quality Standards:** It emphasizes the need for well-formulated Q&A pairs, specifying the structure and content of each question and answer.
    - **Output Format:** An example JSON structure is provided to guide the format accurately.
    - **Constraints and Biases:** A note on avoiding sensitive or biased content ensures ethical considerations are met.
    - **Step-by-Step Guidance:** The prompt breaks down the task into manageable steps, making it easier for the assistant to follow.

    This approach ensures that the generated data is both high-quality and meets the specific requirements of the machine learning project.
    
    Data
    {data}
    """

class Record(BaseModel):
    question: str
    answer: str

class Response(BaseModel):
    generated: List[Record]

def llm_call(data: str, num_records: int = 5) -> dict:
    stream = completion(
        model="ollama/llama3.1",
        messages=[
            {
                "role": "user",
                "content": prompt_template(data, num_records),
            }
        ],
        stream=True,
        options={"num_predict": 2000},
        format=Response.model_json_schema(),
    )
    data = ""
    for x in stream: 
        delta = x['choices'][0]["delta"]["content"]
        if delta is not None: 
            # print(delta, end="") 
            data += delta 
    return json.loads(data)

dataset = []
for i, chunk in enumerate(contextualized_chunks):
    data = llm_call(chunk)
    for pair in data['generated']:
        print(pair)
        dataset.append({
                'question': pair['question'],
                'answer': pair['answer']
            })
tuning_data = 'unsloth_data.json'
with open(tuning_data,'w') as f: 
    json.dump(dataset, f) 

print(f"Done writing {tuning_data} to system")

{'question': 'What are the models that can be fine-tuned with the provided data?', 'answer': 'The data supports fine-tuning of Gemma 3n, Qwen3, Llama 4, Phi-4 & Mistral'}
{'question': 'How will these models benefit from using this dataset?', 'answer': 'These models can be finetuned with the provided data 2x faster and require 80% less VRAM.'}
{'question': 'What are the key differences between Gemma 3n and Qwen3 in terms of performance?', 'answer': 'Gemma 3n (4B) has a 1.5x faster performance compared to Qwen3, while Qwen3 (14B) has a 2x faster performance.'}
{'question': 'Which model uses the least amount of memory among all the options provided?', 'answer': 'Qwen3 (4B): GRPO and Gemma 3 (4B) use 50% and 60% less memory, respectively, but Qwen3 (14B): GRPO uses 80% less memory.'}
{'question': 'Are there any free notebooks available for the models mentioned in this data chunk?', 'answer': 'Yes, most of the models mentioned offer free notebooks, including Gemma 3n, Qwen3, and Llama 3.2 (

Now that we have a set of training data we are ready to train!

Before doing however we need to install unsloth to help speedup the performance the reduce the RAM required for training.

In [4]:
!pip install unsloth vllm

Looking in indexes: https://pypi.org/simple/
Collecting vllm
  Downloading vllm-0.9.1-cp38-abi3-manylinux1_x86_64.whl.metadata (15 kB)
Collecting torch<=2.7.0,>=2.4.0 (from unsloth)
  Using cached torch-2.7.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (29 kB)
Collecting cachetools (from vllm)
  Downloading cachetools-6.1.0-py3-none-any.whl.metadata (5.4 kB)
Collecting blake3 (from vllm)
  Downloading blake3-1.0.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting py-cpuinfo (from vllm)
  Downloading py_cpuinfo-9.0.0-py3-none-any.whl.metadata (794 bytes)
Collecting fastapi>=0.115.0 (from fastapi[standard]>=0.115.0->vllm)
  Downloading fastapi-0.115.14-py3-none-any.whl.metadata (27 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Downloading prometheus_fastapi_instrumentator-7.1.0-py3-none-any.whl.metadata (13 kB)
Collecting lm-format-enforcer<0.11,>=0.10.11 (from vllm)
  Downloading lm_format_enforcer-0.10.11-py3-none-any.whl

How let's setup our tokenizer using unsloth

In [6]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = None
)

prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
"You are a helpful, honest and harmless assitant designed to help engineers. Think through each question logically and provide an answer. Don't make things up, if you're unable to answer a question advise the user that you're unable to answer as it is outside of your scope.

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    inputs       = examples["question"]
    outputs      = examples["answer"]
    texts = []
    for input, output in zip(inputs, outputs):
        text = prompt_style.format(input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

from datasets import load_dataset
dataset = load_dataset("data", split='train')
dataset = dataset.map(
    formatting_prompts_func,
    batched=True,
)
print(dataset["text"][0])

model = FastLanguageModel.get_peft_model(
    model,
    r=128   , 
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=256,
    lora_dropout=0, 
    bias="none", 
   
    use_gradient_checkpointing="unsloth", 
    random_state=3407,
    use_rslora=False, 
    loftq_config=None,
)

==((====))==  Unsloth 2025.6.12: Fast Llama patching. Transformers: 4.53.1. vLLM: 0.9.1.
   \\   /|    NVIDIA GeForce RTX 5070. Num GPUs = 1. Max memory: 11.94 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 

Now that we have the tokenizer we can setup the prompt for fine tuning.

In [7]:

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    num_train_epochs = 10,
    args = TrainingArguments(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 100,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Unsloth: Tokenizing ["text"]:   0%|          | 0/117 [00:00<?, ? examples/s]

The next step is to map our dataset to the template

In [8]:
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(torch.cuda.get_device_properties(0).total_memory / 1024 / 1024 / 1024, 3) 

trainer_stats = trainer.train()

# get performance
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 117 | Num Epochs = 25 | Total steps = 200
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 335,544,320 of 8,000,000,000 (4.19% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,0.5461
2,0.4728
3,0.0953
4,0.0586
5,0.0426
6,0.0245
7,0.0375
8,0.0003
9,0.0039
10,0.0001


549.5646 seconds used for training.
9.16 minutes used for training.
Peak reserved memory = 10.236 GB.
Peak reserved memory for training = 3.576 GB.
Peak reserved memory % of max memory = 85.729 %.
Peak reserved memory for training % of max memory = 29.95 %.


Save the new model

In [9]:
new_model_local = "Llama-3.1-8B-unsloth"
model.save_pretrained(new_model_local) # Local saving
tokenizer.save_pretrained(new_model_local) # Local saving

('Llama-3.1-8B-unsloth/tokenizer_config.json',
 'Llama-3.1-8B-unsloth/special_tokens_map.json',
 'Llama-3.1-8B-unsloth/tokenizer.json')

host in local ollama

In [11]:
!ollama create unsloth-trained

[?2026h[?25l[1Ggathering model components ⠙ [K[?25h[?2026l

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[?2026h[?25l[1Ggathering model components ⠹ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components ⠹ [K[?25h[?2026l[?2026h[?25l[1Ggathering model components [K
copying file sha256:72cc873cb9a73ac730a3fd3680d4c2380d9ad7cb402bbafc8c05e0d441276675 0% ⠋ [K
copying file sha256:36f0351f1afd665ae72623dc7e0d7f51272123f992e93ddda1cb8b625c062778 100% [K
copying file sha256:52716f60c3ad328509fa37cdded9a2f1196ecae463f5480f5d38c66a25e7a7dc 100% [K
copying file sha256:32f404d626cf7b1b6eea36c241ae7cbd6ec29c777a8701ea48c0fe9ebb94c9b1 100% [K
copying file sha256:889397283200148c6aabc316131bb48d62f4b547e4e53da6e13b31851f6a01a1 100% [K[?25h[?2026l[?2026h[?25l[A[A[A[A[A[1Ggathering model components [K
copying file sha256:72cc873cb9a73ac730a3fd3680d4c2380d9ad7cb402bbafc8c05e0d441276675 5% ⠙ [K
copying file sha256:36f0351f1afd665ae72623dc7e0d7f51272123f992e93ddda1cb8b625c062778 100% [K
copying file sha256:52716f60c3ad328509fa37cdded9a2f1196ecae463f5480f5d38c66a25e7a7dc 100

Call latest model from ollama

In [12]:
import ollama
response = ollama.chat(
    model="unsloth-trained",
    messages=[
        {"role": "user", "content": "Give me some Unsloth.ai News?"},
    ],
)
print(response["message"]["content"])

I'm not aware of any recent news or updates from Unsloth.ai. As a conversational AI, I don't have direct access to the web and may not be up-to-date on the latest developments.

However, I can suggest some possible ways to find recent news or updates from Unsloth.ai:

1. Check their website: You can visit Unsloth.ai's website and check for any recent blog posts, announcements, or updates.
2. Social media: Unsloth.ai may have a social media presence on platforms like Twitter, LinkedIn, or Facebook. You can try searching for them on these platforms to see if they've posted any recent updates.
3. News articles: You can try searching online for news articles related to Unsloth.ai or their products.

If you're looking for specific information, I'd be happy to help with that!


Results inconsistent let's try rag

In [20]:
import ollama
from PyPDF2 import PdfReader
import os

# Load and extract text from the PDF
def load_pdf_content(pdf_path):
    if not os.path.exists(pdf_path):
        raise FileNotFoundError(f"PDF file not found: {pdf_path}")
    
    reader = PdfReader(pdf_path)
    text_content = ""
    for page in reader.pages:
        text_content += page.extract_text() + "\n"
    return text_content

# Load the unsloth documentation
pdf_content = load_pdf_content("unsloth_documentation.pdf")

# Create the RAG-enhanced prompt
rag_prompt = f"""Based on the following documentation about Unsloth.ai, please answer the user's question:

Documentation:
{pdf_content}

User Question: Give me some Unsloth.ai News?

Please provide a comprehensive answer based on the documentation provid d."""

response = ollama.chat(
    model="llama3.1",
    messages=[
        {"role": "user", "content": rag_prompt},
    ],
)
print(response["message"]["content"])

Based on the provided documentation, here are some key points and news related to Unsloth:

**News and Updates**

* Unsloth was tested using the Alpaca Dataset with a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down). The results showed that Unsloth achieves a significant VRAM reduction of >75% compared to Hugging Face + FA2.
* Benchmarking of Unsloth was also conducted by Hugging Face. They tested Llama 3.1 (8B) and Llama 3.3 (70B) with QLoRA on all linear layers and reported that Unsloth achieves a longer context length than Hugging Face + FA2.

**Recent Developments**

* Unsloth has added support for 4-bit quantization, which reduces memory usage while maintaining accuracy.
* The team has also implemented full finetuning capabilities, allowing users to fine-tune models from scratch.
* A new LoRA module called "unsloth" has been introduced, which uses 30% less VRAM and fits 2x larger batch sizes.

**Recen