## Welcome to the Second Lab - Week 1, Day 3

Today we will work with lots of models! This is a way to get comfortable with APIs.

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/stop.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Important point - please read</h2>
            <span style="color:#ff7800;">The way I collaborate with you may be different to other courses you've taken. I prefer not to type code while you watch. Rather, I execute Jupyter Labs, like this, and give you an intuition for what's going on. My suggestion is that you carefully execute this yourself, <b>after</b> watching the lecture. Add print statements to understand what's going on, and then come up with your own variations.<br/><br/>If you have time, I'd love it if you submit a PR for changes in the community_contributions folder - instructions in the resources. Also, if you have a Github account, use this to showcase your variations. Not only is this essential practice, but it demonstrates your skills to others, including perhaps future clients or employers...
            </span>
        </td>
    </tr>
</table>

In [1]:
# Start with imports - ask ChatGPT to explain any package that you don't know

import os
import json
from dotenv import load_dotenv
from openai import OpenAI
from anthropic import Anthropic
from IPython.display import Markdown, display

In [13]:
# Always remember to do this!
load_dotenv(override=True)

True

In [14]:
# Print the key prefixes to help with any debugging

# openai_api_key = os.getenv('OPENAI_API_KEY')
anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
google_api_key = os.getenv('GOOGLE_API_KEY')
# deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')
# groq_api_key = os.getenv('GROQ_API_KEY')
xai_api_key = os.getenv('XAI_API_KEY')

# if openai_api_key:
#     print(f"OpenAI API Key exists and begins {openai_api_key[:8]}")
# else:
#     print("OpenAI API Key not set")
    
if anthropic_api_key:
    print(f"Anthropic API Key exists and begins {anthropic_api_key[:7]}")
else:
    print("Anthropic API Key not set (and this is optional)")

if google_api_key:
    print(f"Google API Key exists and begins {google_api_key[:2]}")
else:
    print("Google API Key not set (and this is optional)")

# if deepseek_api_key:
#     print(f"DeepSeek API Key exists and begins {deepseek_api_key[:3]}")
# else:
#     print("DeepSeek API Key not set (and this is optional)")

# if groq_api_key:
#     print(f"Groq API Key exists and begins {groq_api_key[:4]}")
# else:
#     print("Groq API Key not set (and this is optional)")

if xai_api_key:
    print(f"xAI API Key exists and begins {xai_api_key[:3]}")
else:
    print("xAI API Key not set (and this is optional)")

Anthropic API Key exists and begins sk-ant-
Google API Key exists and begins AI
xAI API Key exists and begins xai


In [32]:
competitors = []
answers = []

In [None]:
request = "Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. "
request += "Answer only with the question, no explanation."
messages = [{"role": "user", "content": request}]

In [6]:
messages

[{'role': 'user',
  'content': 'Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. Answer only with the question, no explanation.'}]

In [None]:
# openai = OpenAI()
# response = openai.chat.completions.create(
#     model="gpt-4o-mini",
#     messages=messages,
# )
# question = response.choices[0].message.content

# claude = Anthropic()
# response = claude.messages.create(model="claude-sonnet-4-5-20250929", messages=messages, max_tokens=1000)
# question = response.content[0].text
# print(question)


You discover that a highly advanced alien civilization has been observing Earth for centuries and has concluded that humanity's greatest achievement isn't a technological invention or artistic masterpiece, but rather a specific cognitive habit or way of thinking that emerged from our particular evolutionary and cultural development. What might this habit be, and why would it be more valuable to them than our tangible accomplishments?


In [33]:
question = "i have a gaming laptop and want to run the best ollama model as possible on its gpu with 16gb vram (rtx 3080 mobile ) what is the best model with what quantization"
messages = [{"role": "user", "content": question}]

In [34]:
# The API we know well

# model_name = "gpt-4o-mini"

# response = openai.chat.completions.create(model=model_name, messages=messages)
# answer = response.choices[0].message.content

# display(Markdown(answer))
# competitors.append(model_name)
# answers.append(answer)

In [35]:
# Anthropic has a slightly different API, and Max Tokens is required

model_name = "claude-sonnet-4-5-20250929"

claude = Anthropic()
response = claude.messages.create(model=model_name, messages=messages, max_tokens=1000)
answer = response.content[0].text

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

# Best Ollama Models for RTX 3080 Mobile (16GB VRAM)

With 16GB VRAM, you have several excellent options. Here are my recommendations:

## Top Choices:

### **1. Llama 3.1 70B Q2_K (Best Overall)**
```bash
ollama run llama3.1:70b-instruct-q2_K
```
- **VRAM usage:** ~14-15GB
- **Quality:** Excellent reasoning despite Q2 quantization
- **Why:** The larger parameter count (70B) often outperforms smaller models even at lower quantization

### **2. Qwen 2.5 32B Q4_K_M (Balanced)**
```bash
ollama run qwen2.5:32b-instruct-q4_K_M
```
- **VRAM usage:** ~13-14GB
- **Quality:** Outstanding performance for coding and reasoning
- **Why:** Better quantization (Q4) with very capable 32B model

### **3. Llama 3.1 70B Q3_K_M (If you want higher quality)**
```bash
ollama run llama3.1:70b-instruct-q3_K_M
```
- **VRAM usage:** ~15.5-16GB (tight fit)
- **Quality:** Noticeably better than Q2
- **Why:** Pushes your VRAM limit but worth it

### **4. Mistral Small 22B Q5_K_M (Speed + Quality)**
```bash
ollama run mistral-small:22b-instruct-2409-q5_K_M
```
- **VRAM usage:** ~11-12GB
- **Quality:** Excellent quality with high quantization
- **Why:** Fast inference, great quality, headroom for other apps

## My Recommendation:
Start with **Qwen 2.5 32B Q4_K_M** - it offers the best balance of quality, speed, and VRAM efficiency for most tasks, especially coding.

## Quick Tips:
- Close other applications to maximize available VRAM
- Monitor temps on mobile GPU (they can throttle)
- Use `ollama ps` to check actual VRAM usage

What will you primarily use the model for? I can refine the recommendation based on your use case.

In [36]:
gemini = OpenAI(api_key=google_api_key, base_url="https://generativelanguage.googleapis.com/v1beta/openai/")
model_name = "gemini-2.5-flash"

response = gemini.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

That's a fantastic setup for local AI! Your **RTX 3080 mobile with 16GB VRAM** is a sweet spot, allowing you to run very capable models entirely on your GPU.

The "best" model is always a balance of **quality, speed, and VRAM usage**. With 16GB, you have a good amount to play with.

Here's what I recommend, along with the ideal quantizations:

---

### Understanding Quantization (`Q_K_M`):

*   **Quantization (Q)** reduces the precision of the model's weights, making it smaller and faster, but with a potential slight loss in quality.
*   **`_K_M` (e.g., Q5_K_M)**: This is the modern, optimized quantization scheme for GGUF (the format Ollama uses). It offers a much better balance of speed and quality compared to older quantizations.
*   **Higher number = Less compression = Better quality / More VRAM / Slower.**
*   **Lower number = More compression = Lower quality / Less VRAM / Faster.**

**General Recommendation for your 16GB VRAM:**
You'll generally want to aim for **Q5_K_M** or **Q6_K** for 7B/8B models, and **Q4_K_M** or **Q5_K_M** for 13B models.

---

### Top Model Recommendations for your 16GB VRAM:

1.  **Llama 3 8B (Recommended - Best Overall Balance)**
    *   **Why:** Llama 3 is Meta's latest and greatest. The 8B version is incredibly capable for its size, offering excellent reasoning, instruction following, and general knowledge. It's often considered state-of-the-art for its parameter count.
    *   **Quantization:**
        *   **`llama3:8b-instruct-q6_K` (around 5.8GB VRAM)**: Excellent balance of quality and speed. This is probably the sweet spot.
        *   **`llama3:8b-instruct-q5_K_M` (around 5.1GB VRAM)**: Slightly smaller, very good quality, and slightly faster.
        *   **`llama3:8b-instruct-q8_0` (around 8.2GB VRAM)**: You can even run this for maximum quality on the 8B model, though the difference from q6_K is often minor.
    *   **How to run:** `ollama run llama3:8b-instruct-q6_K`

2.  **Mistral 7B (Excellent Alternative - Fast & Creative)**
    *   **Why:** Mistral 7B is known for its speed, creativity, and good performance on a wide range of tasks. It's a fantastic all-rounder and a bit faster than Llama 3 8B in some cases. Many popular fine-tunes are based on Mistral.
    *   **Quantization:**
        *   **`mistral:7b-instruct-v0.2-q6_K` (around 5.0GB VRAM)**: Great balance.
        *   **`mistral:7b-instruct-v0.2-q5_K_M` (around 4.4GB VRAM)**: Very fast and still high quality.
    *   **How to run:** `ollama run mistral:7b-instruct-v0.2-q6_K`

3.  **OpenHermes 2.5 (Fine-tuned for Chat/Instruction Following)**
    *   **Why:** This is a fine-tune of Mistral 7B, specifically optimized for chat and instruction following. It's often praised for its conversational abilities and willingness to follow complex prompts.
    *   **Quantization:**
        *   **`openhermes2.5-mistral:7b-q6_K` (around 5.0GB VRAM)**: Ideal for quality.
        *   **`openhermes2.5-mistral:7b-q5_K_M` (around 4.4GB VRAM)**: Still very good and zippy.
    *   **How to run:** `ollama run openhermes2.5-mistral:7b-q6_K`

4.  **Llama 2 13B (For More "Brainpower" - Pushing VRAM)**
    *   **Why:** If you want a bit more raw capability than a 7B/8B model and don't mind a slight speed reduction, a 13B model can offer more complex reasoning. While Llama 3 8B often beats Llama 2 13B, there are still scenarios where the larger parameter count can shine.
    *   **Quantization:**
        *   **`llama2:13b-chat-q5_K_M` (around 8.5GB VRAM)**: This will fit comfortably and offer good quality.
        *   **`llama2:13b-chat-q4_K_M` (around 7.5GB VRAM)**: A good option if you want to save a little VRAM or prefer slightly faster generation, with minimal perceived quality loss.
    *   **How to run:** `ollama run llama2:13b-chat-q5_K_M`

---

### What to AVOID (for full GPU offload):

*   **Mixtral 8x7B (or any similar "expert" model):** While Mixtral is excellent, its "8x7B" architecture means it effectively uses weights equivalent to a much larger model (around 45-50B parameters) when all experts are considered. Even a Q4_K_M version of Mixtral is typically ~25-30GB, far exceeding your 16GB VRAM for full GPU offload. It will run very slowly with significant CPU offload.

---

### How to Get Started:

1.  **Install Ollama:** If you haven't already, download and install Ollama from [ollama.com](https://ollama.com/).
2.  **Download and Run:** Open your terminal or command prompt and use the `ollama run` command for the model and quantization you want. Ollama will automatically download it if you don't have it.
    *   Example: `ollama run llama3:8b-instruct-q6_K`
3.  **Monitor VRAM:** While running, open `NVIDIA-smi` in another terminal window to monitor your GPU's VRAM usage.

### Pro Tip: Experiment!

Download a couple of these and try them out. You might find you prefer the "personality" or specific strengths of one over another for your particular use cases. Start with Llama 3 8B Q6_K, as it's currently the go-to for many.

Enjoy your powerful local AI setup!

In [None]:
# deepseek = OpenAI(api_key=deepseek_api_key, base_url="https://api.deepseek.com/v1")
# model_name = "deepseek-chat"

# response = deepseek.chat.completions.create(model=model_name, messages=messages)
# answer = response.choices[0].message.content

# display(Markdown(answer))
# competitors.append(model_name)
# answers.append(answer)

In [None]:
# groq = OpenAI(api_key=groq_api_key, base_url="https://api.groq.com/openai/v1")
# model_name = "llama-3.3-70b-versatile"

# response = groq.chat.completions.create(model=model_name, messages=messages)
# answer = response.choices[0].message.content

# display(Markdown(answer))
# competitors.append(model_name)
# answers.append(answer)


In [37]:
# xAI Grok - uses OpenAI-compatible API
xai = OpenAI(api_key=xai_api_key, base_url="https://api.x.ai/v1")
model_name = "grok-4"

response = xai.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

Based on your setup (gaming laptop with RTX 3080 mobile and 16GB VRAM), I'll recommend the best Ollama model in terms of capability (i.e., intelligence, reasoning, and output quality) that you can run effectively on your GPU. Ollama uses CUDA for NVIDIA GPU acceleration, and it supports automatic layer offloading—if the model doesn't fully fit in VRAM, it loads as many layers as possible onto the GPU and falls back to CPU/RAM for the rest. This still gives a big performance boost over pure CPU.

### Key Considerations for Your Hardware
- **VRAM Limits**: 16GB is solid, but large models (e.g., 70B+ parameters) won't fit entirely in VRAM even when quantized (compressed). Expect ~10-40 layers on GPU depending on the model and quantization (with the rest on CPU). This can yield 5-15 tokens/second generation speed, which is usable for most tasks.
- **Quantization**: Lower quantization (e.g., Q4) reduces model size to fit more on GPU but slightly lowers quality. Higher quantization (e.g., Q8) preserves quality but requires more VRAM. I prioritize a balance of quality, size, and performance.
- **Other Factors**: Your laptop's CPU (assuming something like an Intel i7 or Ryzen 7) and at least 16GB system RAM will help with offloading. Keep context size reasonable (e.g., 4K-8K tokens) to avoid VRAM overflow. Use Ollama's latest version for optimal GPU support.
- **Testing**: Results can vary by exact laptop model (cooling, power limits). Start with `ollama run <model>` and monitor VRAM usage with tools like `nvidia-smi`.

### Recommended Model: Llama 3.1 70B with Q4_K_M Quantization
- **Why this model?** Llama 3.1 70B is one of the most capable open-source models available in Ollama right now—excellent for complex reasoning, coding, writing, and general tasks. It's significantly smarter than smaller models like 7B/8B or 13B variants, even if it runs slower due to partial offloading. For 16GB VRAM, this is the "best" (most advanced) you can realistically run without constant swapping or crashes. Alternatives like Mixtral 8x22B or Llama 3.1 405B are too big even with heavy quantization.
- **Quantization: Q4_K_M**
  - **Why Q4_K_M?** It's a great balance: model size is ~40GB (fits ~25-35 layers on your 16GB GPU, depending on context). Quality is high (close to unquantized) with minimal loss compared to Q3 or Q2. Avoid Q5 or Q6 for 70B—they're ~45-50GB and offload more to CPU, slowing things down. Q4_0 is similar but slightly lower quality than Q4_K_M.
  - **Performance Expectation**: 5-10 tokens/second on generation (faster for short responses). Full GPU usage for the loaded layers; the rest uses CPU. If you want faster speeds at the cost of capability, drop to Llama 3.1 8B (see below).
- **How to Run It**:
  1. Install Ollama (download from ollama.com).
  2. Pull and run: `ollama run llama3.1:70b-instruct-q4_K_M` (this pulls a pre-quantized version; check Ollama's library for exact tags if needed, as they may vary).
  3. Specify GPU layers if desired: Edit the Modelfile or use flags like `--gpu-layers 35` (experiment to max out your VRAM without OOM errors).
  4. Test with a prompt: Something like "Explain quantum computing in simple terms."

If this feels too slow or you prioritize speed over raw capability (e.g., for chatbots or quick queries), go with a smaller model that fits fully in VRAM:

### Faster Alternative: Llama 3.1 8B with Q8_0 Quantization
- **Why?** This is still very capable (great for most everyday tasks) but runs blazing fast since it fits entirely in ~9GB VRAM. Speed: 40-60+ tokens/second.
- **How to Run**: `ollama run llama3.1:8b-instruct-q8_0`.
- **When to Choose This**: If 70B's speed is a dealbreaker, or for low-latency use cases.

### Other Tips
- **Upgrades for Better Performance**: If you can add more system RAM (e.g., 32GB+), it helps with offloading. For even larger models, consider external tools like llama.cpp for finer control.
- **Model Library**: Browse `ollama list` or ollama.com/models for options. Variants like "instruct" are tuned for chat/following instructions.
- **Troubleshooting**: If you get VRAM errors, reduce context size or layers. Update NVIDIA drivers for best CUDA support.
- **Benchmarks**: Search Reddit (r/LocalLLaMA) or Hugging Face for user reports on RTX 3080 mobile with Ollama—experiences match what I've described.

If you provide more details (e.g., your CPU/RAM, intended use case, or speed vs. quality preference), I can refine this further!

## For the next cell, we will use Ollama

Ollama runs a local web service that gives an OpenAI compatible endpoint,  
and runs models locally using high performance C++ code.

If you don't have Ollama, install it here by visiting https://ollama.com then pressing Download and following the instructions.

After it's installed, you should be able to visit here: http://localhost:11434 and see the message "Ollama is running"

You might need to restart Cursor (and maybe reboot). Then open a Terminal (control+\`) and run `ollama serve`

Useful Ollama commands (run these in the terminal, or with an exclamation mark in this notebook):

`ollama pull <model_name>` downloads a model locally  
`ollama ls` lists all the models you've downloaded  
`ollama rm <model_name>` deletes the specified model from your downloads

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/stop.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Super important - ignore me at your peril!</h2>
            <span style="color:#ff7800;">The model called <b>llama3.3</b> is FAR too large for home computers - it's not intended for personal computing and will consume all your resources! Stick with the nicely sized <b>llama3.2</b> or <b>llama3.2:1b</b> and if you want larger, try llama3.1 or smaller variants of Qwen, Gemma, Phi or DeepSeek. See the <A href="https://ollama.com/models">the Ollama models page</a> for a full list of models and sizes.
            </span>
        </td>
    </tr>
</table>

In [None]:
!ollama pull qwen2.5:32b-instruct-q3_K_M

In [38]:
ollama = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
model_name = "qwen2.5:32b-instruct-q3_K_M"

response = ollama.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

Running the largest model available while ensuring it works optimally within your GPU's 16GB VRAM limit requires some consideration of both model size and quantization techniques. The combination of a large model that fits in memory without causing performance degradation due to excessive swapping, along with efficient quantization, is key.

For your RTX 3080 mobile GPU (with 16GB VRAM), several popular models could be considered:

### Large Models
- **llama2**: The LLaMA2 series includes models up to 70 billion parameters. However, the largest versions might exceed your VRAM capacity.
- **Pythia**: Anthropic's larger Pythia models (like the 12B model) are known to work well.

### Quantization
Quantization reduces the precision of the weights in the model, which can significantly reduce the memory footprint. Common quantization formats include:
- **4-bit**: Most efficient for memory but may compromise some performance.
- **8-bit**: Less efficient than 4-bit but typically provides better performance and is often more compatible with existing models.

### Suggestion
Given your constraints (16GB VRAM), a good balance between model size and quantization could be the following:

#### - **Pythia-12B with Quantization**:
Using an 8-bit quantized version of Pythia-12B might work well. With 12 billion parameters, it should fit comfortably into your GPU's memory when properly quantized.

Alternatively,

#### - **LLaMA (or LLaMA2) Models under 65 Billion Parameters with Quantization**:
An 8-bit or even a 4-bit quantized version of the smaller LLaMA2 models might also work well. You could start with an 8-bit quantized model and, if that does not fit as expected (considering overheads), try switching to a 4-bit version.

### Testing
It's essential to test these configurations on your laptop because various factors like other software background tasks can influence VRAM usage. The exact compatibility might vary based on the specific implementation of quantization used by your software stack.

Always ensure that any model you use complies with the licensing and distribution policies set by its creators or distributors.

In [39]:
# So where are we?

print(competitors)
print(answers)


['claude-sonnet-4-5-20250929', 'gemini-2.5-flash', 'grok-4', 'qwen2.5:32b-instruct-q3_K_M']
['# Best Ollama Models for RTX 3080 Mobile (16GB VRAM)\n\nWith 16GB VRAM, you have several excellent options. Here are my recommendations:\n\n## Top Choices:\n\n### **1. Llama 3.1 70B Q2_K (Best Overall)**\n```bash\nollama run llama3.1:70b-instruct-q2_K\n```\n- **VRAM usage:** ~14-15GB\n- **Quality:** Excellent reasoning despite Q2 quantization\n- **Why:** The larger parameter count (70B) often outperforms smaller models even at lower quantization\n\n### **2. Qwen 2.5 32B Q4_K_M (Balanced)**\n```bash\nollama run qwen2.5:32b-instruct-q4_K_M\n```\n- **VRAM usage:** ~13-14GB\n- **Quality:** Outstanding performance for coding and reasoning\n- **Why:** Better quantization (Q4) with very capable 32B model\n\n### **3. Llama 3.1 70B Q3_K_M (If you want higher quality)**\n```bash\nollama run llama3.1:70b-instruct-q3_K_M\n```\n- **VRAM usage:** ~15.5-16GB (tight fit)\n- **Quality:** Noticeably better than

In [40]:
# It's nice to know how to use "zip"
for competitor, answer in zip(competitors, answers):
    print(f"Competitor: {competitor}\n\n{answer}")


Competitor: claude-sonnet-4-5-20250929

# Best Ollama Models for RTX 3080 Mobile (16GB VRAM)

With 16GB VRAM, you have several excellent options. Here are my recommendations:

## Top Choices:

### **1. Llama 3.1 70B Q2_K (Best Overall)**
```bash
ollama run llama3.1:70b-instruct-q2_K
```
- **VRAM usage:** ~14-15GB
- **Quality:** Excellent reasoning despite Q2 quantization
- **Why:** The larger parameter count (70B) often outperforms smaller models even at lower quantization

### **2. Qwen 2.5 32B Q4_K_M (Balanced)**
```bash
ollama run qwen2.5:32b-instruct-q4_K_M
```
- **VRAM usage:** ~13-14GB
- **Quality:** Outstanding performance for coding and reasoning
- **Why:** Better quantization (Q4) with very capable 32B model

### **3. Llama 3.1 70B Q3_K_M (If you want higher quality)**
```bash
ollama run llama3.1:70b-instruct-q3_K_M
```
- **VRAM usage:** ~15.5-16GB (tight fit)
- **Quality:** Noticeably better than Q2
- **Why:** Pushes your VRAM limit but worth it

### **4. Mistral Small 22B Q5

In [41]:
# Let's bring this together - note the use of "enumerate"

together = ""
for index, answer in enumerate(answers):
    together += f"# Response from competitor {index+1}\n\n"
    together += answer + "\n\n"

In [42]:
print(together)

# Response from competitor 1

# Best Ollama Models for RTX 3080 Mobile (16GB VRAM)

With 16GB VRAM, you have several excellent options. Here are my recommendations:

## Top Choices:

### **1. Llama 3.1 70B Q2_K (Best Overall)**
```bash
ollama run llama3.1:70b-instruct-q2_K
```
- **VRAM usage:** ~14-15GB
- **Quality:** Excellent reasoning despite Q2 quantization
- **Why:** The larger parameter count (70B) often outperforms smaller models even at lower quantization

### **2. Qwen 2.5 32B Q4_K_M (Balanced)**
```bash
ollama run qwen2.5:32b-instruct-q4_K_M
```
- **VRAM usage:** ~13-14GB
- **Quality:** Outstanding performance for coding and reasoning
- **Why:** Better quantization (Q4) with very capable 32B model

### **3. Llama 3.1 70B Q3_K_M (If you want higher quality)**
```bash
ollama run llama3.1:70b-instruct-q3_K_M
```
- **VRAM usage:** ~15.5-16GB (tight fit)
- **Quality:** Noticeably better than Q2
- **Why:** Pushes your VRAM limit but worth it

### **4. Mistral Small 22B Q5_K_M (Spee

In [43]:
judge = f"""You are judging a competition between {len(competitors)} competitors.
Each model has been given this question:

{question}

Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.
Respond with JSON, and only JSON, with the following format:
{{"results": ["best competitor number", "second best competitor number", "third best competitor number", ...]}}

Here are the responses from each competitor:

{together}

Now respond with the JSON with the ranked order of the competitors, nothing else. Do not include markdown formatting or code blocks."""


In [53]:
judge = f"""You are judging a competition between {len(competitors)} competitors.
Each model has been given this question:

{question}

Your job is to evaluate each response for clarity and strength of argument, and find the top 3 models cummulated from all responses and why you renk them so high.
add also the model qwen2.5:32b-instruct-q3_K_M into you considerations.
Here are the responses from each competitor:

{together}

Now respond with the JSON with the ranked order of the competitors."""


In [1]:
print(judge)

NameError: name 'judge' is not defined

In [54]:
judge_messages = [{"role": "user", "content": judge}]

In [55]:
# Judgement time!

# openai = OpenAI()
# response = openai.chat.completions.create(
#     model="o3-mini",
#     messages=judge_messages,
# )
# results = response.choices[0].message.content
# print(results)

model_name = "claude-sonnet-4-5-20250929"

claude = Anthropic()
response = claude.messages.create(model=model_name, messages=judge_messages, max_tokens=4000)
results = response.content[0].text



In [None]:
print(results) # ['claude-sonnet-4-5-20250929', 'gemini-2.5-flash', 'grok-4', 'qwen2.5:32b-instruct-q3_K_M']

```json
{
  "ranking": [
    {
      "rank": 1,
      "competitor": 1,
      "justification": "Competitor 1 provides the most comprehensive and practical response. They offer four well-reasoned options with specific commands, VRAM usage estimates, and clear trade-offs. The response directly addresses multiple use cases and includes the most relevant models for 16GB VRAM. The recommendation of Qwen 2.5 32B Q4_K_M as the starting point is excellent, and they also intelligently suggest Llama 3.1 70B at Q2_K and Q3_K_M quantizations - pushing the boundaries of what's possible with 16GB VRAM while maintaining usability."
    },
    {
      "rank": 2,
      "competitor": 3,
      "justification": "Competitor 3 delivers a technically detailed and well-reasoned response. They provide excellent context about layer offloading, realistic performance expectations (5-10 tokens/second), and practical considerations for laptop usage. The recommendation of Llama 3.1 70B Q4_K_M is solid, prioritizing c

In [57]:
# OK let's turn this into results!

results_dict = json.loads(results)
ranks = results_dict["results"]
for index, result in enumerate(ranks):
    competitor = competitors[int(result)-1]
    print(f"Rank {index+1}: {competitor}")

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/exercise.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Exercise</h2>
            <span style="color:#ff7800;">Which pattern(s) did this use? Try updating this to add another Agentic design pattern.
            </span>
        </td>
    </tr>
</table>

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/business.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#00bfff;">Commercial implications</h2>
            <span style="color:#00bfff;">These kinds of patterns - to send a task to multiple models, and evaluate results,
            are common where you need to improve the quality of your LLM response. This approach can be universally applied
            to business projects where accuracy is critical.
            </span>
        </td>
    </tr>
</table>