# Lab 3.1.10: Ollama Integration - Deploy Your Fine-Tuned Model Locally

## The Finish Line üèÅ

You've fine-tuned your model with LoRA, DoRA, DPO, or KTO. Now what? You need to **run it**!

**Ollama** makes running LLMs locally as simple as:
```bash
ollama run my-custom-model
```

### ELI5: What is Ollama?

Think of Ollama like **Docker for LLMs**:
- Docker packages applications ‚Üí Ollama packages AI models
- `docker run nginx` ‚Üí `ollama run llama3`
- Dockerfile ‚Üí Modelfile
- Docker Hub ‚Üí Ollama Library

### Why Ollama for Your Fine-Tuned Models?

| Benefit | Description |
|---------|-------------|
| **Simple Deployment** | No Python scripts, just run a command |
| **API Ready** | Built-in REST API compatible with OpenAI format |
| **Efficient** | GGUF format, Metal/CUDA optimized |
| **Portable** | Works on any machine with Ollama installed |
| **Team Sharing** | Push to registry for team access |
| **Ollama Web UI** | Test models interactively via web interface |

### What You'll Learn

1. **Convert** - Transform your fine-tuned model to GGUF format
2. **Package** - Create a Modelfile with your system prompt
3. **Deploy** - Run locally with one command
4. **Test** - Verify in Ollama Web UI at http://localhost:11434
5. **Integrate** - Use the API in your applications

### Prerequisites

- Completed fine-tuning from previous labs (or use our pre-trained example)
- Ollama installed (`curl -fsSL https://ollama.ai/install.sh | sh`)
- llama.cpp for GGUF conversion

---

## Section 1: Environment Setup

In [None]:
# DGX SPARK NOTE: These packages are pre-installed in the NGC PyTorch container.
# If running outside NGC container, install with: pip install transformers peft bitsandbytes accelerate requests
# IMPORTANT: Do NOT run 'pip install torch' on DGX Spark - use the NGC container instead.

# Verify required packages (should already be available in NGC container)
import importlib
required_packages = ['transformers', 'peft', 'bitsandbytes', 'accelerate', 'requests']
missing = []
for pkg in required_packages:
    try:
        importlib.import_module(pkg.replace('-', '_'))
    except ImportError:
        missing.append(pkg)

if missing:
    print(f"‚ö†Ô∏è Missing packages: {missing}")
    print("Install inside NGC container with: pip install " + " ".join(missing))
else:
    print("‚úÖ All required packages are available")

# Check if Ollama is installed
import subprocess
result = subprocess.run(["which", "ollama"], capture_output=True, text=True)
if result.returncode == 0:
    print(f"‚úÖ Ollama found: {result.stdout.strip()}")
else:
    print("‚ö†Ô∏è Ollama not found. Install with: curl -fsSL https://ollama.ai/install.sh | sh")

In [None]:
import os
import json
import requests
import subprocess
from pathlib import Path
from typing import Optional, Dict, Any

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

---

## Section 2: The Deployment Pipeline

### From Fine-Tuned Weights to Running Model

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                      FINE-TUNED MODEL DEPLOYMENT PIPELINE                    ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                              ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê   ‚îÇ
‚îÇ  ‚îÇ  Fine-Tuned ‚îÇ    ‚îÇ   Merged    ‚îÇ    ‚îÇ    GGUF     ‚îÇ    ‚îÇ   Ollama    ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ   Weights   ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ   Model     ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ   Format    ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ   Model     ‚îÇ   ‚îÇ
‚îÇ  ‚îÇ  (adapter)  ‚îÇ    ‚îÇ (full size) ‚îÇ    ‚îÇ (quantized) ‚îÇ    ‚îÇ  (ready!)   ‚îÇ   ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò   ‚îÇ
‚îÇ        ‚îÇ                  ‚îÇ                  ‚îÇ                  ‚îÇ           ‚îÇ
‚îÇ        ‚ñº                  ‚ñº                  ‚ñº                  ‚ñº           ‚îÇ
‚îÇ   LoRA adapter      Base + LoRA        Compressed         ollama run       ‚îÇ
‚îÇ    ~100-500MB        ~3-16GB           ~2-8GB           my-model           ‚îÇ
‚îÇ                                                                              ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Two Deployment Paths

| Path | Description | Use Case |
|------|-------------|----------|
| **Full Merge** | Merge LoRA ‚Üí Convert to GGUF ‚Üí Ollama | Maximum portability, any model |
| **GGUF LoRA** | Apply LoRA directly to GGUF base | Faster, smaller files |

---

## Section 3: Merge LoRA Adapters into Base Model

### Why Merge First?

LoRA adapters are "diffs" from the base model. For Ollama, we need a complete model.

```
Base Model (7B params) + LoRA Adapter (20M params) = Merged Model (7B params)
                                                     (with fine-tuned weights)
```

In [None]:
def merge_lora_model(
    base_model_id: str,
    adapter_path: str,
    output_path: str,
    push_to_hub: bool = False,
    hub_model_id: Optional[str] = None
) -> None:
    """
    Merge LoRA adapter weights into base model.
    
    This creates a standalone model that doesn't need the adapter.
    """
    print(f"üì• Loading base model: {base_model_id}")
    
    # Load base model in FP16 for merging
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_id,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True
    )
    
    tokenizer = AutoTokenizer.from_pretrained(base_model_id)
    
    print(f"üîå Loading adapter: {adapter_path}")
    
    # Load and merge LoRA weights
    model = PeftModel.from_pretrained(base_model, adapter_path)
    
    print("üîÄ Merging weights...")
    merged_model = model.merge_and_unload()
    
    print(f"üíæ Saving merged model to: {output_path}")
    merged_model.save_pretrained(output_path)
    tokenizer.save_pretrained(output_path)
    
    if push_to_hub and hub_model_id:
        print(f"üöÄ Pushing to Hub: {hub_model_id}")
        merged_model.push_to_hub(hub_model_id)
        tokenizer.push_to_hub(hub_model_id)
    
    print("‚úÖ Merge complete!")
    
    # Show size comparison
    adapter_size = sum(
        f.stat().st_size for f in Path(adapter_path).rglob('*') if f.is_file()
    ) / (1024**3)
    merged_size = sum(
        f.stat().st_size for f in Path(output_path).rglob('*') if f.is_file()
    ) / (1024**3)
    
    print(f"\nüìä Size comparison:")
    print(f"   Adapter: {adapter_size:.2f} GB")
    print(f"   Merged:  {merged_size:.2f} GB")

In [None]:
# Example: Merge a fine-tuned model
# (Uncomment and modify paths for your model)

# merge_lora_model(
#     base_model_id="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
#     adapter_path="./fine-tuned-adapter",
#     output_path="./merged-model"
# )

---

## Section 4: Convert to GGUF Format

### What is GGUF?

GGUF (GPT-Generated Unified Format) is the format Ollama uses:
- **Single file** - All weights, config, and metadata in one `.gguf` file
- **Quantized** - Compressed from FP16 to Q4/Q8 (2-8x smaller)
- **Fast** - Optimized for CPU/GPU inference

### Quantization Options

| Format | Bits | Size (7B) | Quality | Speed | Use Case |
|--------|------|-----------|---------|-------|----------|
| Q2_K | 2-3 | ~3GB | ‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê | Extreme compression |
| Q4_K_M | 4 | ~4GB | ‚≠ê‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê‚≠ê | **Recommended** |
| Q5_K_M | 5 | ~5GB | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê | Quality/size balance |
| Q8_0 | 8 | ~7GB | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê | Near-original quality |
| F16 | 16 | ~14GB | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê | Full precision |

In [None]:
def setup_llama_cpp():
    """
    Clone and build llama.cpp for GGUF conversion.
    """
    llama_cpp_path = Path("./llama.cpp")
    
    if not llama_cpp_path.exists():
        print("üì• Cloning llama.cpp...")
        subprocess.run([
            "git", "clone", 
            "https://github.com/ggerganov/llama.cpp.git"
        ], check=True)
        
        print("üî® Building llama.cpp...")
        subprocess.run(["make", "-j"], cwd=llama_cpp_path, check=True)
    else:
        print("‚úÖ llama.cpp already exists")
    
    # Install Python requirements for conversion
    subprocess.run([
        "pip", "install", "-q", "-r", 
        str(llama_cpp_path / "requirements.txt")
    ])
    
    return llama_cpp_path

In [None]:
def convert_to_gguf(
    model_path: str,
    output_path: str,
    quantization: str = "q4_k_m",
    llama_cpp_path: str = "./llama.cpp"
) -> str:
    """
    Convert HuggingFace model to GGUF format.
    
    Args:
        model_path: Path to merged HuggingFace model
        output_path: Directory for GGUF output
        quantization: Quantization level (q4_k_m, q5_k_m, q8_0, f16)
        llama_cpp_path: Path to llama.cpp directory
    
    Returns:
        Path to the GGUF file
    """
    llama_cpp = Path(llama_cpp_path)
    output_dir = Path(output_path)
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Step 1: Convert to FP16 GGUF
    print("üîÑ Converting to GGUF format...")
    fp16_path = output_dir / "model-f16.gguf"
    
    subprocess.run([
        "python", str(llama_cpp / "convert_hf_to_gguf.py"),
        model_path,
        "--outfile", str(fp16_path),
        "--outtype", "f16"
    ], check=True)
    
    # Step 2: Quantize (if not f16)
    if quantization.lower() != "f16":
        print(f"üì¶ Quantizing to {quantization}...")
        quantized_path = output_dir / f"model-{quantization}.gguf"
        
        subprocess.run([
            str(llama_cpp / "llama-quantize"),
            str(fp16_path),
            str(quantized_path),
            quantization.upper()
        ], check=True)
        
        # Remove FP16 to save space
        fp16_path.unlink()
        final_path = quantized_path
    else:
        final_path = fp16_path
    
    # Show result
    size_gb = final_path.stat().st_size / (1024**3)
    print(f"\n‚úÖ GGUF created: {final_path}")
    print(f"üìè Size: {size_gb:.2f} GB")
    
    return str(final_path)

In [None]:
# Example: Convert merged model to GGUF
# (Uncomment after merging)

# llama_cpp_path = setup_llama_cpp()
# gguf_path = convert_to_gguf(
#     model_path="./merged-model",
#     output_path="./gguf-output",
#     quantization="q4_k_m"
# )

---

## Section 5: Create Ollama Modelfile

### What is a Modelfile?

A Modelfile is like a Dockerfile for LLMs:

```dockerfile
# Dockerfile           # Modelfile
FROM python:3.11       FROM ./model.gguf
COPY . /app            SYSTEM "You are a helpful assistant"
CMD ["python", "app"]  PARAMETER temperature 0.7
```

### Modelfile Components

| Directive | Description | Example |
|-----------|-------------|---------|
| FROM | Base model or GGUF path | `FROM ./model.gguf` |
| SYSTEM | System prompt | `SYSTEM "You are a coding assistant"` |
| TEMPLATE | Chat template | `TEMPLATE "{{ .System }}..."` |
| PARAMETER | Model parameters | `PARAMETER temperature 0.7` |
| LICENSE | License info | `LICENSE "MIT"` |

In [None]:
def create_modelfile(
    gguf_path: str,
    model_name: str,
    system_prompt: str,
    template: Optional[str] = None,
    parameters: Optional[Dict[str, Any]] = None,
    output_dir: str = "."
) -> str:
    """
    Create an Ollama Modelfile for your custom model.
    
    Args:
        gguf_path: Path to the GGUF model file
        model_name: Name for your model in Ollama
        system_prompt: Default system prompt
        template: Optional chat template (auto-detected if None)
        parameters: Optional parameters (temperature, top_p, etc.)
        output_dir: Where to save the Modelfile
    
    Returns:
        Path to the Modelfile
    """
    lines = []
    
    # Base model
    lines.append(f"FROM {gguf_path}")
    lines.append("")
    
    # System prompt
    lines.append(f'SYSTEM """{system_prompt}"""')
    lines.append("")
    
    # Chat template (for Llama 3 style models)
    if template:
        lines.append(f'TEMPLATE """{template}"""')
        lines.append("")
    
    # Parameters
    if parameters:
        for key, value in parameters.items():
            lines.append(f"PARAMETER {key} {value}")
        lines.append("")
    
    # Default parameters if none specified
    else:
        lines.extend([
            "PARAMETER temperature 0.7",
            "PARAMETER top_p 0.9",
            "PARAMETER top_k 40",
            "PARAMETER num_ctx 4096",
            ""
        ])
    
    # Write Modelfile
    modelfile_path = Path(output_dir) / "Modelfile"
    modelfile_path.write_text("\n".join(lines))
    
    print(f"üìù Created Modelfile: {modelfile_path}")
    print("\nContents:")
    print("-" * 50)
    print("\n".join(lines))
    print("-" * 50)
    
    return str(modelfile_path)

In [None]:
# Example Modelfiles for different use cases

# 1. Code Assistant
CODE_ASSISTANT_SYSTEM = """You are an expert programming assistant specialized in Python, 
JavaScript, and system design. You write clean, efficient, well-documented code. 
You explain your reasoning step by step and consider edge cases."""

# 2. Customer Support Bot
SUPPORT_BOT_SYSTEM = """You are a friendly customer support agent for TechCorp. 
You help users troubleshoot issues, answer product questions, and escalate 
complex issues when needed. Always be polite and helpful."""

# 3. Technical Writer
TECH_WRITER_SYSTEM = """You are a technical documentation specialist. 
You write clear, concise documentation with examples. 
You structure content logically and use consistent formatting."""

# Llama 3 Chat Template
LLAMA3_TEMPLATE = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

In [None]:
# Create a Modelfile (example)
# Uncomment after converting to GGUF

# modelfile_path = create_modelfile(
#     gguf_path="./gguf-output/model-q4_k_m.gguf",
#     model_name="my-code-assistant",
#     system_prompt=CODE_ASSISTANT_SYSTEM,
#     template=LLAMA3_TEMPLATE,
#     parameters={
#         "temperature": 0.7,
#         "top_p": 0.95,
#         "num_ctx": 8192
#     }
# )

---

## Section 6: Deploy with Ollama

### Three Ways to Deploy

In [None]:
class OllamaDeployer:
    """
    Deploy and manage custom models with Ollama.
    """
    
    def __init__(self, host: str = "http://localhost:11434"):
        self.host = host
        self.api_url = f"{host}/api"
    
    def check_ollama_running(self) -> bool:
        """Check if Ollama server is running."""
        try:
            response = requests.get(f"{self.host}/api/tags", timeout=5)
            return response.status_code == 200
        except requests.exceptions.ConnectionError:
            return False
    
    def create_model(
        self, 
        model_name: str, 
        modelfile_path: str
    ) -> bool:
        """
        Create a new model from a Modelfile.
        
        Equivalent to: ollama create <name> -f <Modelfile>
        """
        if not self.check_ollama_running():
            print("‚ùå Ollama is not running. Start with: ollama serve")
            return False
        
        print(f"üèóÔ∏è Creating model: {model_name}")
        
        modelfile_content = Path(modelfile_path).read_text()
        
        response = requests.post(
            f"{self.api_url}/create",
            json={
                "name": model_name,
                "modelfile": modelfile_content
            },
            stream=True
        )
        
        for line in response.iter_lines():
            if line:
                status = json.loads(line)
                if "status" in status:
                    print(f"   {status['status']}")
        
        print(f"‚úÖ Model created: {model_name}")
        return True
    
    def list_models(self) -> list:
        """List all available models."""
        response = requests.get(f"{self.api_url}/tags")
        return response.json().get("models", [])
    
    def delete_model(self, model_name: str) -> bool:
        """Delete a model."""
        response = requests.delete(
            f"{self.api_url}/delete",
            json={"name": model_name}
        )
        return response.status_code == 200
    
    def model_info(self, model_name: str) -> dict:
        """Get model information."""
        response = requests.post(
            f"{self.api_url}/show",
            json={"name": model_name}
        )
        return response.json()

In [None]:
# Deploy your model
deployer = OllamaDeployer()

# Check if Ollama is running
if deployer.check_ollama_running():
    print("‚úÖ Ollama is running")
    
    # List current models
    models = deployer.list_models()
    print(f"\nüìã Available models ({len(models)}):")
    for model in models:
        size_gb = model.get('size', 0) / (1024**3)
        print(f"   ‚Ä¢ {model['name']}: {size_gb:.2f}GB")
else:
    print("‚ö†Ô∏è Ollama not running. Start with: ollama serve")

In [None]:
# Create your custom model (uncomment after creating Modelfile)

# deployer.create_model(
#     model_name="my-code-assistant",
#     modelfile_path="./Modelfile"
# )

### Command Line Alternative

```bash
# Create model from Modelfile
ollama create my-code-assistant -f Modelfile

# Run the model
ollama run my-code-assistant

# List models
ollama list

# Show model info
ollama show my-code-assistant

# Delete model
ollama rm my-code-assistant
```

---

## Section 7: Using the Ollama API

Ollama provides a REST API compatible with OpenAI format.

In [None]:
class OllamaClient:
    """
    Simple client for Ollama API.
    
    Compatible with OpenAI API format for easy migration.
    """
    
    def __init__(self, host: str = "http://localhost:11434"):
        self.host = host
    
    def generate(
        self,
        model: str,
        prompt: str,
        system: Optional[str] = None,
        stream: bool = False,
        **kwargs
    ) -> str:
        """
        Generate a completion.
        
        Args:
            model: Model name
            prompt: User prompt
            system: Optional system prompt override
            stream: Whether to stream response
            **kwargs: Additional parameters (temperature, top_p, etc.)
        """
        payload = {
            "model": model,
            "prompt": prompt,
            "stream": stream,
            **kwargs
        }
        
        if system:
            payload["system"] = system
        
        response = requests.post(
            f"{self.host}/api/generate",
            json=payload
        )
        
        return response.json()["response"]
    
    def chat(
        self,
        model: str,
        messages: list,
        stream: bool = False,
        **kwargs
    ) -> str:
        """
        Chat completion (OpenAI compatible).
        
        Args:
            model: Model name
            messages: List of {"role": "user/assistant/system", "content": "..."}
            stream: Whether to stream response
        """
        response = requests.post(
            f"{self.host}/api/chat",
            json={
                "model": model,
                "messages": messages,
                "stream": stream,
                **kwargs
            }
        )
        
        return response.json()["message"]["content"]
    
    def chat_stream(
        self,
        model: str,
        messages: list,
        **kwargs
    ):
        """
        Streaming chat completion.
        
        Yields tokens as they're generated.
        """
        response = requests.post(
            f"{self.host}/api/chat",
            json={
                "model": model,
                "messages": messages,
                "stream": True,
                **kwargs
            },
            stream=True
        )
        
        for line in response.iter_lines():
            if line:
                chunk = json.loads(line)
                if "message" in chunk:
                    yield chunk["message"].get("content", "")

In [None]:
# Test with a model (using llama3.2:1b for demo)
client = OllamaClient()

# Check if we have a model to test with
deployer = OllamaDeployer()
if deployer.check_ollama_running():
    models = deployer.list_models()
    if models:
        test_model = models[0]["name"].split(":")[0]
        print(f"Testing with model: {test_model}")
        
        # Simple generation
        response = client.generate(
            model=test_model,
            prompt="What is 2+2? Reply with just the number.",
            options={"temperature": 0.1}
        )
        print(f"\nResponse: {response}")
    else:
        print("No models available. Run: ollama pull llama3.2:1b")
else:
    print("‚ö†Ô∏è Ollama not running")

In [None]:
# Chat example with streaming
if deployer.check_ollama_running() and models:
    print("Streaming response:")
    print("-" * 40)
    
    for token in client.chat_stream(
        model=test_model,
        messages=[
            {"role": "user", "content": "Write a haiku about machine learning."}
        ],
        options={"temperature": 0.7}
    ):
        print(token, end="", flush=True)
    
    print("\n" + "-" * 40)

---

## Section 8: OpenAI-Compatible API

Ollama can serve as a drop-in replacement for OpenAI API!

In [None]:
# DGX SPARK NOTE: Install openai package inside NGC container if not already available
# The openai package is a pure Python package and works fine on ARM64
import importlib
try:
    importlib.import_module('openai')
    print("‚úÖ openai package is available")
except ImportError:
    print("Installing openai package...")
    import subprocess
    subprocess.run(["pip", "install", "-q", "openai"], check=True)
    print("‚úÖ openai package installed")

In [None]:
from openai import OpenAI

# Point OpenAI client to Ollama
openai_client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Any string works, Ollama doesn't check
)

if deployer.check_ollama_running() and models:
    # Use OpenAI API format with Ollama
    response = openai_client.chat.completions.create(
        model=test_model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What's the capital of France?"}
        ],
        temperature=0.7
    )
    
    print("OpenAI-compatible response:")
    print(response.choices[0].message.content)
else:
    print("‚ö†Ô∏è Start Ollama first: ollama serve")

### Why This Matters

```python
# Your existing code:
from openai import OpenAI
client = OpenAI(api_key="sk-...")

# Now runs locally with ONE line change:
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

# Rest of your code stays exactly the same!
```

---

## Section 9: Complete Deployment Pipeline

Let's put it all together in one function.

In [None]:
def deploy_finetuned_model(
    base_model_id: str,
    adapter_path: str,
    ollama_model_name: str,
    system_prompt: str,
    quantization: str = "q4_k_m",
    work_dir: str = "./deployment",
    cleanup: bool = True
) -> bool:
    """
    Complete pipeline: LoRA adapter ‚Üí Running Ollama model.
    
    Args:
        base_model_id: HuggingFace model ID
        adapter_path: Path to LoRA adapter
        ollama_model_name: Name for the Ollama model
        system_prompt: Default system prompt
        quantization: GGUF quantization level
        work_dir: Working directory for intermediate files
        cleanup: Whether to clean up intermediate files
    
    Returns:
        True if successful
    """
    work_path = Path(work_dir)
    work_path.mkdir(parents=True, exist_ok=True)
    
    print("="*60)
    print("üöÄ DEPLOYMENT PIPELINE")
    print("="*60)
    
    # Step 1: Merge LoRA
    print("\nüìç Step 1/4: Merging LoRA adapter...")
    merged_path = work_path / "merged"
    merge_lora_model(
        base_model_id=base_model_id,
        adapter_path=adapter_path,
        output_path=str(merged_path)
    )
    
    # Step 2: Setup llama.cpp
    print("\nüìç Step 2/4: Setting up conversion tools...")
    llama_cpp_path = setup_llama_cpp()
    
    # Step 3: Convert to GGUF
    print("\nüìç Step 3/4: Converting to GGUF...")
    gguf_dir = work_path / "gguf"
    gguf_path = convert_to_gguf(
        model_path=str(merged_path),
        output_path=str(gguf_dir),
        quantization=quantization,
        llama_cpp_path=str(llama_cpp_path)
    )
    
    # Step 4: Create Modelfile and deploy
    print("\nüìç Step 4/4: Creating Ollama model...")
    modelfile_path = create_modelfile(
        gguf_path=gguf_path,
        model_name=ollama_model_name,
        system_prompt=system_prompt,
        output_dir=str(work_path)
    )
    
    deployer = OllamaDeployer()
    success = deployer.create_model(
        model_name=ollama_model_name,
        modelfile_path=modelfile_path
    )
    
    # Cleanup
    if cleanup and success:
        print("\nüßπ Cleaning up intermediate files...")
        import shutil
        shutil.rmtree(merged_path)
        # Keep GGUF for backup
    
    print("\n" + "="*60)
    if success:
        print(f"‚úÖ SUCCESS! Run your model with:")
        print(f"   ollama run {ollama_model_name}")
    else:
        print("‚ùå Deployment failed. Check error messages above.")
    print("="*60)
    
    return success

In [None]:
# Full deployment example
# Uncomment with your actual paths

# deploy_finetuned_model(
#     base_model_id="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
#     adapter_path="./my-lora-adapter",
#     ollama_model_name="my-assistant",
#     system_prompt="You are a helpful AI assistant specialized in...",
#     quantization="q4_k_m"
# )

---

## Section 10: Alternative - GGUF LoRA (Advanced)

Instead of merging, you can apply LoRA to a GGUF base model directly.

In [None]:
def create_gguf_lora_modelfile(
    base_model: str,  # Ollama model name or GGUF path
    adapter_gguf: str,  # LoRA in GGUF format
    model_name: str,
    system_prompt: str,
    output_dir: str = "."
) -> str:
    """
    Create Modelfile that applies LoRA at runtime.
    
    This avoids merging - the LoRA is applied on the fly.
    Useful when you have multiple LoRAs for the same base.
    """
    lines = [
        f"FROM {base_model}",
        f"ADAPTER {adapter_gguf}",
        "",
        f'SYSTEM """{system_prompt}"""',
        "",
        "PARAMETER temperature 0.7",
        "PARAMETER top_p 0.9"
    ]
    
    modelfile_path = Path(output_dir) / "Modelfile"
    modelfile_path.write_text("\n".join(lines))
    
    print("üìù Created LoRA Modelfile:")
    print("\n".join(lines))
    
    return str(modelfile_path)

# This requires converting LoRA to GGUF format:
# python convert_lora_to_gguf.py --input adapter --output adapter.gguf

### When to Use GGUF LoRA vs Merged

| Approach | Pros | Cons | Best For |
|----------|------|------|----------|
| **Merged** | Single file, simpler | Larger file, can't swap | Production deployment |
| **GGUF LoRA** | Swappable, smaller files | Slightly slower, complex | Multiple adapters |

---

## Section 11: Production Tips

### 1. Model Registry Organization

```bash
# Name your models clearly
ollama create company/model-v1.0-code -f Modelfile.code
ollama create company/model-v1.0-support -f Modelfile.support
ollama create company/model-v1.1-code -f Modelfile.code.v2
```

### 2. Running as a Service

```bash
# Systemd service (Linux)
sudo systemctl enable ollama
sudo systemctl start ollama

# Docker
docker run -d -p 11434:11434 -v ollama:/root/.ollama ollama/ollama
```

### 3. Performance Tuning

In [None]:
# Ollama environment variables for DGX Spark
OLLAMA_CONFIG = """
# Add to ~/.bashrc or ~/.zshrc

# GPU memory allocation (for unified memory systems)
export OLLAMA_GPU_OVERHEAD="500MB"

# Number of parallel requests
export OLLAMA_NUM_PARALLEL=4

# Keep models loaded longer
export OLLAMA_KEEP_ALIVE="30m"

# Maximum loaded models
export OLLAMA_MAX_LOADED_MODELS=2

# Context window (for large context models)
export OLLAMA_NUM_CTX=8192
"""

print(OLLAMA_CONFIG)

### 4. Sharing with Your Team

```bash
# Push to Ollama registry (requires account)
ollama push username/my-model:latest

# Team member pulls it
ollama pull username/my-model:latest
ollama run username/my-model
```

### 5. Monitoring

In [None]:
def monitor_ollama():
    """Monitor Ollama server status and loaded models."""
    deployer = OllamaDeployer()
    
    if not deployer.check_ollama_running():
        print("‚ùå Ollama not running")
        return
    
    # Get running models
    response = requests.get("http://localhost:11434/api/ps")
    running = response.json().get("models", [])
    
    print("üìä Ollama Status")
    print("=" * 40)
    print(f"Running models: {len(running)}")
    
    for model in running:
        name = model.get("name", "unknown")
        size = model.get("size", 0) / (1024**3)
        expires = model.get("expires_at", "N/A")
        print(f"  ‚Ä¢ {name}: {size:.2f}GB (expires: {expires})")

if deployer.check_ollama_running():
    monitor_ollama()

---

## Section 12: Troubleshooting Guide

### Common Issues and Solutions

| Issue | Cause | Solution |
|-------|-------|----------|
| "model not found" | Model not created | `ollama create name -f Modelfile` |
| Connection refused | Ollama not running | `ollama serve` |
| Out of memory | Model too large | Use smaller quantization (Q4 vs Q8) |
| Slow generation | Not using GPU | Check `ollama ps` shows GPU usage |
| Wrong template | Mismatched chat format | Update TEMPLATE in Modelfile |
| Can't access Web UI | Server not running | Start with `ollama serve`, access at http://localhost:11434 |

**Testing in Ollama Web UI**: For interactive testing and benchmarking, access Ollama Web UI at http://localhost:11434 after starting the server with `ollama serve`.

In [None]:
def diagnose_ollama():
    """Diagnose common Ollama issues."""
    print("üîç Ollama Diagnostics")
    print("=" * 40)
    
    # Check if installed
    result = subprocess.run(["which", "ollama"], capture_output=True, text=True)
    if result.returncode == 0:
        print(f"‚úÖ Ollama installed: {result.stdout.strip()}")
    else:
        print("‚ùå Ollama not installed")
        print("   Install: curl -fsSL https://ollama.ai/install.sh | sh")
        return
    
    # Check if running
    deployer = OllamaDeployer()
    if deployer.check_ollama_running():
        print("‚úÖ Ollama server running")
    else:
        print("‚ùå Ollama server not running")
        print("   Start: ollama serve")
        return
    
    # Check models
    models = deployer.list_models()
    print(f"‚úÖ Models available: {len(models)}")
    
    # Check GPU
    if torch.cuda.is_available():
        print(f"‚úÖ GPU available: {torch.cuda.get_device_name(0)}")
        print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f}GB")
    else:
        print("‚ö†Ô∏è No CUDA GPU (CPU inference only)")

diagnose_ollama()

---

## Key Takeaways

### 1. The Deployment Pipeline
```
LoRA Adapter ‚Üí Merge ‚Üí GGUF ‚Üí Modelfile ‚Üí Ollama Model
```

### 2. Essential Commands
```bash
ollama create mymodel -f Modelfile  # Create
ollama run mymodel                   # Run
ollama list                          # List
ollama rm mymodel                    # Delete
```

### 3. OpenAI Compatibility
```python
# Just change base_url!
client = OpenAI(base_url="http://localhost:11434/v1", api_key="x")
```

### 4. Quantization Guide
- **Q4_K_M**: Best balance for most uses
- **Q8_0**: When quality matters more than size
- **Q2_K**: When you're really tight on memory

---

## Exercises

### Exercise 1: Deploy a Pre-trained Model
Create a custom Modelfile for `llama3.2:1b` with a specific system prompt for your use case.

### Exercise 2: Full Pipeline
Take a fine-tuned adapter from Lab 3.1.4 and deploy it to Ollama.

### Exercise 3: Build an Application
Create a simple chat application using the Ollama API with your custom model.

### Exercise 4: Compare Quantizations
Convert the same model to Q4, Q5, and Q8. Compare:
- File sizes
- Generation speed
- Output quality

---

## Congratulations! üéâ

You've completed the **LLM Fine-Tuning Module**!

### What You've Learned

| Lab | Topic | Key Skill |
|-----|-------|----------|
| 3.1.1 | LoRA Theory | Understand low-rank adaptation |
| 3.1.2 | DoRA | Weight decomposition for +3.7 points |
| 3.1.3 | NEFTune | 5-line trick for +29% quality |
| 3.1.4 | 8B Fine-tuning | Complete training pipeline |
| 3.1.5 | 70B QLoRA | Fine-tune massive models locally |
| 3.1.6 | Dataset Prep | Format and clean training data |
| 3.1.7 | DPO | Align models with preferences |
| 3.1.8 | SimPO/ORPO | Modern preference optimization |
| 3.1.9 | KTO | Train with binary feedback |
| 3.1.10 | Ollama | Deploy models locally |

### You Can Now

‚úÖ Fine-tune any open-source LLM  
‚úÖ Use the latest efficient techniques (DoRA, NEFTune)  
‚úÖ Align models with human preferences (DPO, SimPO, ORPO, KTO)  
‚úÖ Deploy your models for local inference  
‚úÖ Run 70B models on your DGX Spark  

### Next Steps

- **Module 3.2**: Quantization - Make models even smaller
- **Module 3.3**: Deployment - Production serving at scale
- **Module 3.4**: Test-Time Compute - Inference optimization