Initial query

In [None]:
import requests
import json

url = "http://localhost:11434/api/chat"
payload = {
    "model": "deepseek-r1:1.5b",
    "messages": [{"role": "user", "content": "Tell me about QLoRA in 100 words or less."}],
    "stream": False
}
response = requests.post(url, data=json.dumps(payload))
print(response.json())

Now let's do it in python. 

First we need to install ollama with pip

In [None]:
!pip install ollama

Next let's create a simple python query using the ollama library

In [None]:
import ollama
response = ollama.chat(
    model="deepseek-r1:1.5b",
    messages=[
        {"role": "user", "content": "Why is QLoRA such a popular method for fine tuning. Explain this in 100 words or less."},
    ],
)
print(response["message"]["content"])

As you can see unsloth is an unown term to the default deepseek model and the results can be quite humorous. We will want to fix this by fine tuning the model and give it some information about the unsloth library. 
The first step in doing this is to convert the provided unsloth_documentation.pdf into a dataset for training.
For this task we can use Docling and LiteLLM.
To help visualize different outputs we'll also install colorama to color code out terminal outputs

In [None]:
!pip install docling litellm colorama

Most Likely you will want to run this on GPU so you'll need to install pytorch as well. I used the latest version with CUDA 12.8 since I have a 5000 series GPU which is not supported by earlier CUDA versions. Please check version [compatability](https://developer.nvidia.com/cuda-gpus)

In [None]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

You may also might need to update huggingface cache persmissions and pre-download the docling-models.

In [None]:
sudo chown -R $(whoami) ~/.cache/huggingface
huggingface-cli download ds4sd/docling-models

Now We can being getting out data chunks from the pdf and save them to a list of chunks

In [None]:
import warnings
warnings.filterwarnings('ignore')

from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker

converter = DocumentConverter()
doc = converter.convert("unsloth_documentation.pdf").document
chunker = HybridChunker()
chunks = chunker.chunk(dl_doc=doc)

contextualized_chunks = []
for i, chunk in enumerate(chunks): 
    print( f"Raw Text:\n{chunk.text[:300]}…" )
    contextualized_chunks.append(chunker.contextualize(chunk=chunk))
    print(f"Contextualized Text:\n{contextualized_chunks[i][:300]}…")

Now that we have contextualize chunks we can use ollama to generate data for our fine tuning

In [None]:
import json
from typing import List 
from pydantic import BaseModel
from litellm import completion

def prompt_template(data: str, num_records: int = 5):

    return f"""You are an expert data curator assisting a machine learning engineer in creating a high-quality instruction tuning dataset. Your task is to transform 
    the provided data chunk into diverse question and answer (Q&A) pairs that will be used to fine-tune a language model. 

    For each of the {num_records} entries, generate one or two well-structured questions that reflect different aspects of the information in the chunk. 
    Ensure a mix of longer and shorter questions, with shorter ones typically containing 1-2 sentences and longer ones spanning up to 3-4 sentences. Each 
    Q&A pair should be concise yet informative, capturing key insights from the data.

    Structure your output in JSON format, where each object contains 'question' and 'answer' fields. The JSON structure should look like this:

        "question": "Your question here...",
        "answer": "Your answer here..."

    Focus on creating clear, relevant, and varied questions that encourage the model to learn from diverse perspectives. Avoid any sensitive or biased 
    content, ensuring answers are accurate and neutral.

    Example:
    
        "question": "What is the primary purpose of this dataset?",
        "answer": "This dataset serves as training data for fine-tuning a language model."
    

    By following these guidelines, you'll contribute to a robust and effective dataset that enhances the model's performance."

    ---

    **Explanation:**

    - **Clarity and Specificity:** The revised prompt clearly defines the role of the assistant and the importance of the task, ensuring alignment with the 
    project goals.
    - **Quality Standards:** It emphasizes the need for well-formulated Q&A pairs, specifying the structure and content of each question and answer.
    - **Output Format:** An example JSON structure is provided to guide the format accurately.
    - **Constraints and Biases:** A note on avoiding sensitive or biased content ensures ethical considerations are met.
    - **Step-by-Step Guidance:** The prompt breaks down the task into manageable steps, making it easier for the assistant to follow.

    This approach ensures that the generated data is both high-quality and meets the specific requirements of the machine learning project.
    
    Data
    {data}
    """

class Record(BaseModel):
    question: str
    answer: str

class Response(BaseModel):
    generated: List[Record]

def llm_call(data: str, num_records: int = 5) -> dict:
    stream = completion(
        model="ollama_chat/llama3.1",
        messages=[
            {
                "role": "user",
                "content": prompt_template(data, num_records),
            }
        ],
        stream=True,
        options={"num_predict": 2000},
        format=Response.model_json_schema(),
    )
    data = ""
    for x in stream: 
        delta = x['choices'][0]["delta"]["content"]
        if delta is not None: 
            print(delta, end="") 
            data += delta 
    return json.loads(data)

dataset = {}
for i, chunk in enumerate(contextualized_chunks):
    data = llm_call(chunk)
    dataset[i] = {"generated":data["generated"], "context":chunk}

with open('unsloth_data.json','w') as f: 
    json.dump(dataset, f) 

{
  "generated": [
    {
      "question": "What are the benefits of using specific language models for fine-tuning?",
      "answer": "Using Gemma 3n, Qwen3, Llama 4, Phi-4 & Mistral results in a speed boost and significant VRAM reduction."
    },
    {
      "question": "Can you elaborate on how these specialized models improve performance?",
      "answer": "They enable fine-tuning up to 2x faster with 80% less VRAM requirements compared to the original setup."
    },
    {
      "question": "What are the implications of using these optimized language models for large-scale applications?",
      "answer": "By leveraging Gemma 3n, Qwen3, Llama 4, Phi-4 & Mistral, developers can achieve better performance and efficiency in their projects."
    },
    {
      "question": "How do the specifications of these language models compare to traditional models?",
      "answer": "These specialized models provide a unique combination of speed and memory efficiency not typically seen in standard 