# Demo 4 - Developing a LLM Router with `LLMPop` Using GPT5 for Vibe Coding
By: [Lior Gazit](https://github.com/LiorGazit).  
Repo: [Agents-Over-The-Weekend](https://github.com/PacktPublishing/Agents-Over-The-Weekend/tree/main/Lior_Gazit/workshop_september_2025/)  
Running LLMs locally for free: This code leverages [`LLMPop`](https://pypi.org/project/llmpop/) that is dedicated to spinning up local or remote LLMs in a unified and modular syntax.    

In this notebook we experiment vibe-coding with ChatGPT.  
We do so by providing GPT5 with a single prompt + reference material so to design a smart LLM router that maps each user prompt to the optimal LLM based on considerations of cost, speed, and depth.  
Reference material provided to GPT5 for building the code solution:    
1. A machine-readable guide for the library `LLMPop`, a library that unifies call syntax to all LLMs, built to suit LLM routers.  
2. A table of LLMs so to provide GPT5 context on various LLMs so to build the routing logic around.  

<a target="_blank" href="https://colab.research.google.com/github/PacktPublishing/Agents-Over-The-Weekend/blob/main/Lior_Gazit/workshop_september_2025/codes_for_Lior_Bootcamp_talk_demo4.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a> (pick a GPU Colab session for fastest computing)  

```
Disclaimer: The content and ideas presented in this notebook are solely those of the author, Lior Gazit, and do not represent the views or intellectual property of the author's employer.
```

## Prompt for coder-LLM:


Write a Python script (Later to become a stand alone Jupyter notebook) that demonstrates a small-scale LLM router using the LLMPop library.

For information on LLMPop:  
Details on the LLMPop library are described in the attached document: LLM_READABLE_GUIDE.md  
Refere to LLM_READABLE_GUIDE.md for understanding how to use the LLMPop library.  

For information on the different models available on Ollama:  
Refer to the file Ollama models by size and call name (a limited list).csv  
It has a limited list of models, and it gives some idea on which are lighted and which are heavier in size.  

### Requirements:

1. Purpose: The script routes user prompts to one of several available LLMs (some from Ollama for local inference, others from OpenAI’s API, per the LLMPop libraries functionality) based on simple routing logic.
2. Inputs to Routing Function:
   * User prompt (text)
   * Optional conversation context (previous messages, user metadata, etc.)
   * Optional optimization criteria (e.g., prioritize cost, speed, or reasoning depth)
   * Optional runtime resource information (e.g., system load, API quotas)
3. Routing Logic: Implement a basic, illustrative strategy to decide which model to call.  
   For example:
   * Short/simple prompts → lightweight, fast model (local Ollama model).
   * Complex prompts (math/logic/code) → reasoning-heavy OpenAI GPT-4 model.
   * Requests flagged as “specialized” → a chosen domain-specific model.
   * Add dummy evaluation for resource optimization (simulate cost or latency).
4. Output: Show which model is selected, call it with a test prompt, and print a sample response.
5. Implementation Notes:
   * Import and use functions/classes from LLMPop (assume the library is installed and documentation is available).
   * Include a main() function that:
     * Accepts a sample user query,
     * Runs it through your routing logic,
     * Dispatches it to the chosen LLM.
   * Add inline comments explaining the decision-making process.
   * Keep it minimal, clean, and easy to adapt as a demo or template.

### Output Expectations:
* Full runnable Python code (later to be a notebook).
* Clear, commented code that demonstrates basic extensibility for future routing improvements.
* Use resource checks.

### Goal:
Deliver a concise, readable prototype demonstrating LLM routing patterns, not a production service.

## The perliminary code the bot wrote:  

```python

"""
Small-Scale LLM Router Demo
---------------------------
Purpose:
  Route a user prompt to a local (Ollama) or remote (OpenAI) model using basic,
  illustrative heuristics. Keep it readable and easy to adapt.

Prereqs:
  pip -q install llmpop
  pip install langchain_core psutil pandas

Notes:
  - Reads an optional CSV of Ollama models to choose a lightweight or heavier local model:
      /mnt/data/Ollama models by size and call name (a limited list).csv
  - Uses `init_llm` from LLMPop (returns a LangChain ChatModel).
  - Uses simple heuristics to choose a model: complexity, specialization, optimization,
    plus a quick CPU/memory snapshot and dummy cost/latency scoring.
  - Falls back gracefully if a model or provider isn't available.
"""

from __future__ import annotations
import os
import re
import time
from dataclasses import dataclass
from typing import Any, Dict, List, Optional

# --- LLMPop imports (per your guide) ---
# Guide reference: init_llm() returns a ChatModel (ChatOllama / ChatOpenAI)
# and start_resource_monitoring() optionally logs CPU/Mem/GPU in a background thread.
# (This demo just shows you how to call them; monitoring is optional.)
from llmpop import init_llm, start_resource_monitoring  # :contentReference[oaicite:1]{index=1}

# --- Optional utilities ---
import psutil
import pandas as pd

# -----------------------
# Model catalog & metadata
# -----------------------

@dataclass
class ModelSpec:
    name: str               # provider-specific model id (e.g., 'llama3:8b' or 'gpt-4o-mini')
    provider: str           # 'ollama' or 'openai'
    size_rank: int          # smaller number = lighter/faster (heuristic)
    tags: List[str]         # ['general', 'code', 'math', 'reasoning', 'specialized:finance', ...]
    est_cost_per_1k: float  # dummy USD per 1k tokens (for routing illustration)
    est_latency: float      # dummy seconds base latency (for routing illustration)
    notes: str = ""

def _read_ollama_csv(csv_path: str) -> List[ModelSpec]:
    """
    Read the user's 'Ollama models by size and call name (a limited list).csv'
    and convert it into ModelSpec entries with rough size ranks.
    The CSV is expected to have columns like: 'call_name', 'size' (text), etc.
    This function is defensive and will skip rows missing these fields.
    """
    specs: List[ModelSpec] = []
    if not os.path.exists(csv_path):
        return specs

    try:
        df = pd.read_csv(csv_path)
    except Exception:
        return specs

    # Try to infer a size rank from a 'size' or 'parameters' column
    def infer_size_rank(size_str: str) -> int:
        if not isinstance(size_str, str):
            return 5
        s = size_str.lower()
        # simple heuristics: lower rank is "lighter"
        if any(k in s for k in ["2b", "3b", "tiny", "mini", "small"]): return 1
        if any(k in s for k in ["7b", "8b", "medium"]): return 2
        if any(k in s for k in ["13b", "14b", "large"]): return 3
        if any(k in s for k in ["30b", "34b", "xl"]): return 4
        return 5

    for _, row in df.iterrows():
        call_name = str(row.get("call_name") or row.get("name") or "").strip()
        size_str  = str(row.get("size") or row.get("parameters") or "").strip()
        if not call_name:
            continue
        specs.append(
            ModelSpec(
                name=call_name,
                provider="ollama",
                size_rank=infer_size_rank(size_str),
                tags=["general"],
                est_cost_per_1k=0.0,        # local inference → treat as $0 for demo
                est_latency=max(0.2, 0.05 * infer_size_rank(size_str)),
                notes=f"From CSV; size='{size_str}'"
            )
        )
    return specs

def _default_catalog() -> List[ModelSpec]:
    """
    Default demo catalog if CSV is unavailable. Adjust to your environment.
    """
    return [
        # Local (Ollama) — fast/light general chat
        ModelSpec(name="llama3.1:8b-instruct", provider="ollama", size_rank=2,
                  tags=["general"], est_cost_per_1k=0.0, est_latency=0.3,
                  notes="Local, lightweight general model"),
        # Local (Ollama) — code-aware (replace with your local favorite)
        ModelSpec(name="codellama:7b-instruct", provider="ollama", size_rank=2,
                  tags=["code"], est_cost_per_1k=0.0, est_latency=0.35,
                  notes="Local, code-oriented"),
        # Remote (OpenAI) — reasoning-heavy
        ModelSpec(name="gpt-4o", provider="openai", size_rank=4,
                  tags=["reasoning", "math", "complex"], est_cost_per_1k=5.0, est_latency=0.6,
                  notes="Remote, strong reasoning"),
        # Remote (OpenAI) — fast/cheap
        ModelSpec(name="gpt-4o-mini", provider="openai", size_rank=1,
                  tags=["general", "fast"], est_cost_per_1k=0.6, est_latency=0.25,
                  notes="Remote, budget/latency friendly"),
    ]

def build_catalog(csv_path: str) -> List[ModelSpec]:
    specs = _read_ollama_csv(csv_path)
    specs += _default_catalog()
    return specs

# -----------------------
# Lightweight heuristics
# -----------------------

def is_complex_prompt(text: str) -> bool:
    """
    Naive complexity signal: math markers, code fences, long length, or
    words indicating step-by-step reasoning.
    """
    if len(text) > 300:
        return True
    code_markers = ["```", "def ", "class ", "SELECT ", "FROM ", "import ", "public static void"]
    math_markers = [r"\bprove\b", r"\bderive\b", r"\bO\(", r"\bintegral\b", r"[0-9]+\s*[\+\-\*\/\^]\s*[0-9]+"]
    reasoning_terms = [r"\bchain of thought\b", r"\bstep-by-step\b", r"\bthink step by step\b"]

    if any(m in text for m in code_markers):
        return True
    if any(re.search(p, text, flags=re.IGNORECASE) for p in math_markers + reasoning_terms):
        return True
    return False

def detect_specialization(text: str) -> Optional[str]:
    """
    Return a specialization tag when obvious (very naive).
    """
    pairs = {
        "code": [r"\bwrite (a|the) (python|js|sql|java|c\+\+|c#)"],
        "sql":  [r"\bSELECT\b.*\bFROM\b", r"\bCREATE TABLE\b"],
        "finance": [r"\bDCF\b", r"\bWACC\b", r"\b10-K\b"],
        "biology": [r"\bRNA\b", r"\bCRISPR\b"],
        "law": [r"\bstatute\b", r"\btort\b"],
        "math": [r"\bprove\b", r"\bintegral\b", r"\bderivative\b"],
    }
    for tag, patterns in pairs.items():
        if any(re.search(p, text, flags=re.IGNORECASE | re.DOTALL) for p in patterns):
            return tag
    return None

def snapshot_resources() -> Dict[str, Any]:
    """
    Quick resource snapshot (CPU %, available RAM GB). These values can inform routing.
    """
    try:
        cpu = psutil.cpu_percent(interval=0.2)
        mem = psutil.virtual_memory()
        avail_gb = round(mem.available / (1024**3), 2)
        return {"cpu_percent": cpu, "available_gb": avail_gb}
    except Exception:
        return {"cpu_percent": None, "available_gb": None}

def score_candidate(
    spec: ModelSpec,
    *,
    optimization: str,
    complex_prompt: bool,
    specialization: Optional[str],
    resources: Dict[str, Any]
) -> float:
    """
    Lower score = better. Combine a few simple signals:
      - If complex → prefer 'reasoning' models.
      - If specialization detected → prefer related tags (e.g., 'code', 'math').
      - Optimization: 'speed', 'cost', or 'depth' (reasoning depth).
      - If CPU is high and provider == 'ollama' → small penalty (local load).
    """
    score = 0.0

    # Base on optimization goal
    if optimization == "speed":
        score += spec.est_latency * 2.0 + spec.size_rank * 0.5
    elif optimization == "cost":
        score += spec.est_cost_per_1k * 1.5 + spec.size_rank * 0.3
    elif optimization == "depth":
        # prefer reasoning-heavy
        score += (0 if ("reasoning" in spec.tags or "complex" in spec.tags) else 1.5)
        score += spec.size_rank * 0.2
    else:
        # default blend
        score += spec.est_latency + 0.2 * spec.est_cost_per_1k

    # Complexity
    if complex_prompt and not any(t in spec.tags for t in ["reasoning", "complex", "math", "code"]):
        score += 1.0

    # Specialization
    if specialization:
        if specialization in spec.tags or f"specialized:{specialization}" in spec.tags:
            score -= 0.5  # small bonus
        else:
            # mild penalty if it doesn't look appropriate
            score += 0.3

    # Resource-aware: if CPU is busy and we're local, prefer remote
    cpu = resources.get("cpu_percent")
    if cpu is not None and cpu > 70 and spec.provider == "ollama":
        score += 0.8

    # If RAM is very low and spec is large local model (size_rank>=3), penalize
    avail_gb = resources.get("available_gb")
    if avail_gb is not None and avail_gb < 2.0 and spec.provider == "ollama" and spec.size_rank >= 3:
        score += 0.7

    return score

# -----------------------
# Dispatch
# -----------------------

def init_model(spec: ModelSpec):
    """
    Initialize a ChatModel via LLMPop's init_llm().
    For Ollama, auto_serve/auto_install/pull are enabled to make the demo smoother.
    For OpenAI, use env OPENAI_API_KEY or pass provider_kwargs with api_key.
    """
    if spec.provider == "ollama":
        return init_llm(
            model=spec.name,
            provider="ollama",
            provider_kwargs={
                "auto_install": True,
                "auto_serve": True,
                "pull": True,
            },
            temperature=0.0,
        )  # :contentReference[oaicite:2]{index=2}

    elif spec.provider == "openai":
        # You can also pass provider_kwargs={"api_key": "..."} (guide allows both env or kwarg)
        return init_llm(
            model=spec.name,
            provider="openai",
            temperature=0.0,
        )  # :contentReference[oaicite:3]{index=3}

    else:
        raise ValueError(f"Unknown provider: {spec.provider}")

def call_model(chat_model, user_prompt: str, system_msg: str = "You are a helpful assistant."):
    """
    Call a LangChain ChatModel returned by init_llm(). Keep it simple with a short system + human turn.
    """
    from langchain_core.messages import SystemMessage, HumanMessage
    # Return raw content; for pure string you could add StrOutputParser (see guide). :contentReference[oaicite:4]{index=4}
    result = chat_model.invoke([SystemMessage(content=system_msg), HumanMessage(content=user_prompt)])
    return getattr(result, "content", str(result))

# -----------------------
# Main router
# -----------------------

def route_and_dispatch(
    prompt: str,
    *,
    conversation_ctx: Optional[List[Dict[str, Any]]] = None,
    optimization: str = "speed",   # 'speed' | 'cost' | 'depth'
    runtime_info: Optional[Dict[str, Any]] = None,
    csv_path: str = "/mnt/data/Ollama models by size and call name (a limited list).csv",
) -> Dict[str, Any]:
    """
    Core router: choose a model and execute the call. Returns a dict with
    selection details and the model output.
    """
    catalog = build_catalog(csv_path)

    complex_prompt = is_complex_prompt(prompt)
    specialization = detect_specialization(prompt)
    resources = snapshot_resources()

    # Incorporate any runtime quotas into a tiny penalty (dummy)
    quotas = (runtime_info or {}).get("api_quotas", {})
    openai_quota_low = quotas.get("openai_remaining", 1000) < 50
    ollama_quota_low = quotas.get("ollama_remaining", 1000) < 10

    # Score each candidate
    scored = []
    for spec in catalog:
        s = score_candidate(spec,
                            optimization=optimization,
                            complex_prompt=complex_prompt,
                            specialization=specialization,
                            resources=resources)
        # Quotas as penalties (illustrative)
        if spec.provider == "openai" and openai_quota_low:
            s += 1.0
        if spec.provider == "ollama" and ollama_quota_low:
            s += 0.5
        scored.append((s, spec))

    scored.sort(key=lambda x: x[0])
    chosen = scored[0][1]

    # Try to init and call; fall back if needed
    error = None
    response = None
    try:
        chat_model = init_model(chosen)
        response = call_model(chat_model, prompt)
    except Exception as e:
        error = f"{type(e).__name__}: {e}"
        # Fallback: attempt the next candidate
        if len(scored) > 1:
            try:
                alt = scored[1][1]
                chat_model = init_model(alt)
                response = call_model(chat_model, prompt)
                chosen = alt
            except Exception as e2:
                response = f"[Router demo fallback] Could not reach any LLM. Last error: {type(e2).__name__}: {e2}"

    return {
        "chosen_model": chosen.name,
        "provider": chosen.provider,
        "notes": chosen.notes,
        "complex_prompt": complex_prompt,
        "specialization": specialization,
        "optimization": optimization,
        "resources": resources,
        "error": error,
        "response": response,
    }

# -----------------------
# Demo entrypoint
# -----------------------

def main():
    # Optional: start a background resource logger (1 minute, every 5s) — for demo visibility.
    # This is purely illustrative; comment out if not desired.
    try:
        _ = start_resource_monitoring(logfile="resource_usage.log", duration=60, interval=5)  # :contentReference[oaicite:5]{index=5}
    except Exception:
        pass

    # Sample inputs (edit these to experiment)
    sample_prompt = (
        "Write a short Python function that returns the nth Fibonacci number, "
        "then analyze the time complexity."
    )
    conversation_ctx = [{"role": "system", "content": "You are concise."}]
    optimization = "depth"  # try 'speed', 'cost', or 'depth'

    # Dummy runtime info: pretend OpenAI quota is healthy; Ollama is unlimited.
    runtime_info = {"api_quotas": {"openai_remaining": 500, "ollama_remaining": 9999}}

    result = route_and_dispatch(
        prompt=sample_prompt,
        conversation_ctx=conversation_ctx,
        optimization=optimization,
        runtime_info=runtime_info,
    )

    print("\n=== Router Decision ===")
    print(f"Chosen: {result['chosen_model']} (provider={result['provider']})")
    print(f"Notes:  {result['notes']}")
    print(f"Complex prompt? {result['complex_prompt']}; specialization: {result['specialization']}")
    print(f"Optimization: {result['optimization']}")
    print(f"Resources snapshot: {result['resources']}")
    if result["error"]:
        print(f"Init error (first try): {result['error']}")

    print("\n=== Model Output (truncated) ===")
    print(str(result["response"])[:1200])

if __name__ == "__main__":
    main()
```

## The Actual Resulting Code
I took the code that the bot generated after a single prompt, and adjusted it to suit my needs.  
All adjustments were minor.  

The bot told about the code it designed:  

**Small-Scale LLM Router Demo**  
Purpose:  
  Route a user prompt to a local (Ollama) or remote (OpenAI) model using basic,
  illustrative heuristics. Keep it readable and easy to adapt.  

Prereqs:  
`pip -q install llmpop `  
`pip install langchain_core psutil pandas`  

Notes:  
  - Reads an optional CSV of Ollama models to choose a lightweight or heavier local model:  
      `Ollama models by size and call name (a limited list).csv`  
  - Uses `init_llm` from LLMPop (returns a LangChain ChatModel).  
  - Uses simple heuristics to choose a model: complexity, specialization, optimization,
    plus a quick CPU/memory snapshot and dummy cost/latency scoring.  
  - Falls back gracefully if a model or provider isn't available.

**Installations:**

In [1]:
%pip -q install llmpop
%pip -q install langchain_core psutil pandas

**Imports:**

In [2]:
from __future__ import annotations
import os
import re
import time
from dataclasses import dataclass
from typing import Any, Dict, List, Optional
from getpass import getpass

# --- LLMPop imports (per LLM_READABLE_GUIDE.md) ---
# Guide reference: init_llm() returns a ChatModel (ChatOllama / ChatOpenAI)
# and start_resource_monitoring() optionally logs CPU/Mem/GPU in a background thread.
from llmpop import init_llm, start_resource_monitoring  # :contentReference[oaicite:1]{index=1}

# --- Optional utilities ---
import psutil
import pandas as pd

**Settings:**

In [3]:
allow_paid_models = True

**Insert your OpenAI API key:**

In [4]:
if allow_paid_models:
  os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI API key: ")

Paste your OpenAI API key: ··········


In [13]:
# -----------------------
# Model catalog & metadata
# -----------------------

@dataclass
class ModelSpec:
    name: str               # provider-specific model id (e.g., 'llama3:8b' or 'gpt-4o-mini')
    provider: str           # 'ollama' or 'openai'
    size_rank: int          # smaller number = lighter/faster (heuristic)
    tags: List[str]         # ['general', 'code', 'math', 'reasoning', 'specialized:finance', ...]
    est_cost_per_1k: float  # dummy USD per 1k tokens (for routing illustration)
    est_latency: float      # dummy seconds base latency (for routing illustration)
    notes: str = ""

def _read_ollama_csv(csv_path: str) -> List[ModelSpec]:
    """
    Read the user's 'Ollama models by size and call name (a limited list).csv'
    and convert it into ModelSpec entries with rough size ranks.
    The CSV is expected to have columns like: 'call_name', 'size' (text), etc.
    This function is defensive and will skip rows missing these fields.
    """
    specs: List[ModelSpec] = []
    if not os.path.exists(csv_path):
        return specs

    try:
        df = pd.read_csv(csv_path)
    except Exception:
        return specs

    # Try to infer a size rank from a 'size' or 'parameters' column
    def infer_size_rank(size_str: str) -> int:
        if not isinstance(size_str, str):
            return 5
        s = size_str.lower()
        # simple heuristics: lower rank is "lighter"
        if any(k in s for k in ["2b", "3b", "tiny", "mini", "small"]): return 1
        if any(k in s for k in ["7b", "8b", "medium"]): return 2
        if any(k in s for k in ["13b", "14b", "large"]): return 3
        if any(k in s for k in ["30b", "34b", "xl"]): return 4
        return 5

    for _, row in df.iterrows():
        call_name = str(row.get("call_name") or row.get("name") or "").strip()
        size_str  = str(row.get("size") or row.get("parameters") or "").strip()
        if not call_name:
            continue
        specs.append(
            ModelSpec(
                name=call_name,
                provider="ollama",
                size_rank=infer_size_rank(size_str),
                tags=["general"],
                est_cost_per_1k=0.0,        # local inference → treat as $0 for demo
                est_latency=max(0.2, 0.05 * infer_size_rank(size_str)),
                notes=f"From CSV; size='{size_str}'"
            )
        )
    return specs

def _default_catalog(allow_paid_models: bool = True) -> List[ModelSpec]:
    """
    Default demo catalog if CSV is unavailable. Adjust to your environment.
    """
    catalog = [
        # Local (Ollama) — fast/light general chat
        ModelSpec(name="llama3.2:1b", provider="ollama", size_rank=2,
                  tags=["general"], est_cost_per_1k=0.0, est_latency=0.3,
                  notes="Local, lightweight general model"),
        # Local (Ollama) — code-aware (replace with your local favorite)
        ModelSpec(name="codellama", provider="ollama", size_rank=2,
                  tags=["code"], est_cost_per_1k=0.0, est_latency=0.35,
                  notes="Local, code-oriented"),
        # Local (Ollama) — medium general chat
        ModelSpec(name="gpt-oss:20b", provider="ollama", size_rank=3,
                  tags=["general", "code", "math", "complex"], est_cost_per_1k=0.0, est_latency=0.5,
                  notes="Local, medium size general model")]
    if allow_paid_models:
      catalog += [
          # Remote (OpenAI) — reasoning-heavy
          ModelSpec(name="gpt-4o-mini", provider="openai", size_rank=1,
                    tags=["general", "fast"], est_cost_per_1k=0.6, est_latency=0.8,
                    notes="Remote, budget/latency friendly"),
          # Remote (OpenAI) — fast/cheap
          ModelSpec(name="gpt-4o", provider="openai", size_rank=4,
                    tags=["reasoning", "math", "complex"], est_cost_per_1k=5.0, est_latency=0.1,
                    notes="Remote, strong reasoning")
          ]
    return catalog

def build_catalog(csv_path: str, allow_paid_models: bool = True) -> List[ModelSpec]:
    specs = _read_ollama_csv(csv_path)
    specs += _default_catalog(allow_paid_models)
    return specs

# -----------------------
# Lightweight heuristics
# -----------------------

def is_complex_prompt(text: str) -> bool:
    """
    Naive complexity signal: math markers, code fences, long length, or
    words indicating step-by-step reasoning.
    """
    if len(text) > 300:
        return True
    code_markers = ["```", "def ", "class ", "SELECT ", "FROM ", "import ", "public static void"]
    math_markers = [r"\bprove\b", r"\bderive\b", r"\bO\(", r"\bintegral\b", r"[0-9]+\s*[\+\-\*\/\^]\s*[0-9]+"]
    reasoning_terms = [r"\bchain of thought\b", r"\bstep-by-step\b", r"\bthink step by step\b"]

    if any(m in text for m in code_markers):
        return True
    if any(re.search(p, text, flags=re.IGNORECASE) for p in math_markers + reasoning_terms):
        return True
    return False

def detect_specialization(text: str) -> Optional[str]:
    """
    Return a specialization tag when obvious (very naive).
    """
    pairs = {
        "code": [r"\bwrite (a|the) (python|js|sql|java|c\+\+|c#)"],
        "sql":  [r"\bSELECT\b.*\bFROM\b", r"\bCREATE TABLE\b"],
        "finance": [r"\bDCF\b", r"\bWACC\b", r"\b10-K\b"],
        "biology": [r"\bRNA\b", r"\bCRISPR\b"],
        "law": [r"\bstatute\b", r"\btort\b"],
        "math": [r"\bprove\b", r"\bintegral\b", r"\bderivative\b"],
    }
    for tag, patterns in pairs.items():
        if any(re.search(p, text, flags=re.IGNORECASE | re.DOTALL) for p in patterns):
            return tag
    return None

def snapshot_resources() -> Dict[str, Any]:
    """
    Quick resource snapshot (CPU %, available RAM GB). These values can inform routing.
    """
    try:
        cpu = psutil.cpu_percent(interval=0.2)
        mem = psutil.virtual_memory()
        avail_gb = round(mem.available / (1024**3), 2)
        return {"cpu_percent": cpu, "available_gb": avail_gb}
    except Exception:
        return {"cpu_percent": None, "available_gb": None}

def score_candidate(
    spec: ModelSpec,
    *,
    optimization: str,
    complex_prompt: bool,
    specialization: Optional[str],
    resources: Dict[str, Any]
) -> float:
    """
    Lower score = better. Combine a few simple signals:
      - If complex → prefer 'reasoning' models.
      - If specialization detected → prefer related tags (e.g., 'code', 'math').
      - Optimization: 'speed', 'cost', or 'depth' (reasoning depth).
      - If CPU is high and provider == 'ollama' → small penalty (local load).
    """
    score = 0.0

    # Based on optimization goal
    if optimization == "speed":
        score += spec.est_latency * 2.0 + spec.size_rank * 0.5
    elif optimization == "cost":
        score += spec.est_cost_per_1k * 1.5 + spec.size_rank * 0.3
    elif optimization == "depth":
        # prefer reasoning-heavy
        score += (0 if ("reasoning" in spec.tags or "complex" in spec.tags) else 1.5)
        score += spec.size_rank * 0.2
    else:
        # default blend
        score += spec.est_latency + 0.2 * spec.est_cost_per_1k

    # Complexity
    if complex_prompt and not any(t in spec.tags for t in ["reasoning", "complex", "math", "code"]):
        score += 1.0

    # Specialization
    if specialization:
        if specialization in spec.tags or f"specialized:{specialization}" in spec.tags:
            score -= 0.5  # small bonus
        else:
            # mild penalty if it doesn't look appropriate
            score += 0.3

    # Resource-aware: if CPU is busy and we're local, prefer remote
    cpu = resources.get("cpu_percent")
    if cpu is not None and cpu > 70 and spec.provider == "ollama":
        score += 0.8

    # If RAM is very low and spec is large local model (size_rank>=3), penalize
    avail_gb = resources.get("available_gb")
    if avail_gb is not None and avail_gb < 2.0 and spec.provider == "ollama" and spec.size_rank >= 3:
        score += 0.7

    return score

# -----------------------
# Dispatch
# -----------------------

def init_model(spec: ModelSpec):
    """
    Initialize a ChatModel via LLMPop's init_llm().
    For Ollama, auto_serve/auto_install/pull are enabled to make the demo smoother.
    For OpenAI, use env OPENAI_API_KEY or pass provider_kwargs with api_key.
    """
    if spec.provider == "ollama":
        return init_llm(
            model=spec.name,
            provider="ollama",
            provider_kwargs={
                "auto_install": True,
                "auto_serve": True,
                "pull": True,
            },
            temperature=0.0,
        )  # :contentReference[oaicite:2]{index=2}

    elif spec.provider == "openai":
        return init_llm(
            model=spec.name,
            provider="openai",
            temperature=0.0,
        )  # :contentReference[oaicite:3]{index=3}

    else:
        raise ValueError(f"Unknown provider: {spec.provider}")

def call_model(chat_model, user_prompt: str, system_msg: str = "You are a helpful assistant."):
    """
    Call a LangChain ChatModel returned by init_llm(). Keep it simple with a short system + human turn.
    """
    from langchain_core.messages import SystemMessage, HumanMessage
    # Return raw content; for pure string you could add StrOutputParser (see guide). :contentReference[oaicite:4]{index=4}
    result = chat_model.invoke([SystemMessage(content=system_msg), HumanMessage(content=user_prompt)])
    return getattr(result, "content", str(result))

# -----------------------
# Main router
# -----------------------

def route_and_dispatch(
    prompt: str,
    *,
    conversation_ctx: Optional[List[Dict[str, Any]]] = None,
    optimization: str = "speed",   # 'speed' | 'cost' | 'depth'
    runtime_info: Optional[Dict[str, Any]] = None,
    csv_path: str = "Ollama models by size and call name (a limited list).csv",
    allow_paid_models: bool = True
) -> Dict[str, Any]:
    """
    Core router: choose a model and execute the call. Returns a dict with
    selection details and the model output.
    """
    catalog = build_catalog(csv_path, allow_paid_models)

    complex_prompt = is_complex_prompt(prompt)
    specialization = detect_specialization(prompt)
    resources = snapshot_resources()

    # Incorporate any runtime quotas into a tiny penalty (dummy)
    quotas = (runtime_info or {}).get("api_quotas", {})
    openai_quota_low = quotas.get("openai_remaining", 1000) < 50
    ollama_quota_low = quotas.get("ollama_remaining", 1000) < 10

    # Score each candidate
    scored = []
    for spec in catalog:
        s = score_candidate(spec,
                            optimization=optimization,
                            complex_prompt=complex_prompt,
                            specialization=specialization,
                            resources=resources)
        # Quotas as penalties (illustrative)
        if spec.provider == "openai" and openai_quota_low:
            s += 1.0
        if spec.provider == "ollama" and ollama_quota_low:
            s += 0.5
        scored.append((s, spec))

    scored.sort(key=lambda x: x[0])
    chosen = scored[0][1]

    # Try to init and call; fall back if needed
    error = None
    response = None
    try:
        chat_model = init_model(chosen)
        response = call_model(chat_model, prompt)
    except Exception as e:
        error = f"{type(e).__name__}: {e}"
        # Fallback: attempt the next candidate
        if len(scored) > 1:
            try:
                alt = scored[1][1]
                chat_model = init_model(alt)
                response = call_model(chat_model, prompt)
                chosen = alt
            except Exception as e2:
                response = f"[Router demo fallback] Could not reach any LLM. Last error: {type(e2).__name__}: {e2}"

    return {
        "chosen_model": chosen.name,
        "provider": chosen.provider,
        "notes": chosen.notes,
        "complex_prompt": complex_prompt,
        "specialization": specialization,
        "optimization": optimization,
        "resources": resources,
        "error": error,
        "response": response,
    }

In [14]:

# -----------------------
# Demo entrypoint
# -----------------------

def main(prompt: str,
      conversation_ctx: Optional[List[Dict[str, Any]]] = None,
      optimization: str = "speed",   # 'speed' | 'cost' | 'depth'
      runtime_info: Optional[Dict[str, Any]] = None,
      allow_paid_models: bool = True):
    # # Optional: start a background resource logger (1 minute, every 5s) — for demo visibility.
    # # This is purely illustrative; comment out if not desired.
    # try:
    #     _ = start_resource_monitoring(logfile="resource_usage.log", duration=60, interval=5)  # :contentReference[oaicite:5]{index=5}
    # except Exception:
    #     pass

    result = route_and_dispatch(
        prompt=prompt,
        conversation_ctx=conversation_ctx,
        optimization=optimization,
        runtime_info=runtime_info,
        allow_paid_models=allow_paid_models
    )

    print("\n=== Router Decision ===")
    print(f"Chosen: {result['chosen_model']} (provider={result['provider']})")
    print(f"Notes:  {result['notes']}")
    print(f"Complex prompt? {result['complex_prompt']}; specialization: {result['specialization']}")
    print(f"Optimization: {result['optimization']}")
    print(f"Resources snapshot: {result['resources']}")
    if result["error"]:
        print(f"Init error (first try): {result['error']}")

    print("\n=== Model Output (truncated) ===")
    print(str(result["response"])[:1200])

if __name__ == "__main__":
    # Sample inputs (edit these to experiment)
    prompt = (
        "Write a short Python function that returns the nth Fibonacci number, "
        "then analyze the time complexity."
    )
    conversation_ctx = [{"role": "system", "content": "You are concise."}]
    optimization = "speed"  # try 'speed', 'cost', or 'depth'

    # Dummy runtime info: pretend OpenAI quota is healthy; Ollama is unlimited.
    runtime_info = {"api_quotas": {"openai_remaining": 500, "ollama_remaining": 9999}}
    allow_paid_models = True
    main(prompt, conversation_ctx, optimization, runtime_info, allow_paid_models)

🚀 Starting Ollama server...
→ Ollama PID: 7355
⏳ Waiting for Ollama to be ready…
Ready!

🚀 Pulling model 'llama3.2:1b'…
All done setting up Ollama (ChatOllama).


=== Router Decision ===
Chosen: llama3.2:1b (provider=ollama)
Notes:  Local, lightweight general model
Complex prompt? False; specialization: None
Optimization: speed
Resources snapshot: {'cpu_percent': 0.0, 'available_gb': 10.75}

=== Model Output (truncated) ===
**Fibonacci Function and Time Complexity Analysis**

Here's a simple Python function to calculate the nth Fibonacci number using recursion:

```python
def fibonacci(n):
    """
    Returns the nth Fibonacci number.

    Args:
        n (int): The position of the Fibonacci number to return.

    Returns:
        int: The nth Fibonacci number.
    """
    if n <= 0:
        raise ValueError("n must be a positive integer")
    elif n == 1:
        return 0
    elif n == 2:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)
```

**Time Complexity 

----

## Evaluating Various Scenarios

### Optimize depth: coding task

In [15]:
# Sample inputs (edit these to experiment)
prompt = (
    "Write a short Python function that returns the nth Fibonacci number, "
    "then analyze the time complexity."
)
conversation_ctx = [{"role": "system", "content": "You are concise."}]
optimization = "depth"  # try 'speed', 'cost', or 'depth'

# Dummy runtime info: pretend OpenAI quota is healthy; Ollama is unlimited.
runtime_info = {"api_quotas": {"openai_remaining": 500, "ollama_remaining": 9999}}
allow_paid_models = True
main(prompt, conversation_ctx, optimization, runtime_info, allow_paid_models)

🚀 Starting Ollama server...
→ Ollama PID: 7395
⏳ Waiting for Ollama to be ready…
Ready!

🚀 Pulling model 'gpt-oss:20b'…
All done setting up Ollama (ChatOllama).


=== Router Decision ===
Chosen: gpt-oss:20b (provider=ollama)
Notes:  Local, medium size general model
Complex prompt? False; specialization: None
Optimization: depth
Resources snapshot: {'cpu_percent': 35.0, 'available_gb': 10.74}

=== Model Output (truncated) ===
Here’s a compact, **iterative** implementation that returns the *n*‑th Fibonacci number (with the convention `fib(0)=0`, `fib(1)=1`):

```python
def fib(n: int) -> int:
    """Return the n‑th Fibonacci number (0‑based)."""
    if n < 0:
        raise ValueError("n must be non‑negative")
    a, b = 0, 1          # a = fib(0), b = fib(1)
    for _ in range(n):
        a, b = b, a + b  # shift one step forward
    return a
```

### How it works
* `a` always holds `fib(k)` and `b` holds `fib(k+1)` for the current loop index `k`.
* Each iteration updates them to the nex

### Optimize depth: coding task  
Excluding paid remote models, sticking with local models.

In [16]:
# Sample inputs (edit these to experiment)
prompt = (
    "Write a short Python function that returns the nth Fibonacci number, "
    "then analyze the time complexity."
)
conversation_ctx = [{"role": "system", "content": "You are concise."}]
optimization = "depth"  # try 'speed', 'cost', or 'depth'

# Dummy runtime info: pretend OpenAI quota is healthy; Ollama is unlimited.
runtime_info = {"api_quotas": {"openai_remaining": 500, "ollama_remaining": 9999}}
allow_paid_models = False
main(prompt, conversation_ctx, optimization, runtime_info, allow_paid_models)

🚀 Starting Ollama server...
→ Ollama PID: 8349
⏳ Waiting for Ollama to be ready…
Ready!

🚀 Pulling model 'gpt-oss:20b'…
All done setting up Ollama (ChatOllama).


=== Router Decision ===
Chosen: gpt-oss:20b (provider=ollama)
Notes:  Local, medium size general model
Complex prompt? False; specialization: None
Optimization: depth
Resources snapshot: {'cpu_percent': 58.5, 'available_gb': 9.63}

=== Model Output (truncated) ===
Here’s a compact, **iterative** implementation that returns the *n*‑th Fibonacci number (with the convention `fib(0)=0`, `fib(1)=1`):

```python
def fib(n: int) -> int:
    """Return the n‑th Fibonacci number (0‑based)."""
    if n < 0:
        raise ValueError("n must be non‑negative")
    a, b = 0, 1          # a = fib(0), b = fib(1)
    for _ in range(n):
        a, b = b, a + b  # shift one step forward
    return a
```

### How it works
* `a` always holds the current Fibonacci number `fib(k)`.
* `b` holds the next one `fib(k+1)`.
* Each loop iteration moves one

### Optimize cost: Math puzzle

In [21]:
# Sample inputs (edit these to experiment)
prompt = (
    "Prove that the squary root of 2 is irrational."
    "Then simplify it in one sentence for a 6 year old."
)
conversation_ctx = [{"role": "system", "content": "You are concise."}]
optimization = "cost"  # try 'speed', 'cost', or 'depth'

# Dummy runtime info: pretend OpenAI quota is healthy; Ollama is unlimited.
runtime_info = {"api_quotas": {"openai_remaining": 500, "ollama_remaining": 9999}}
allow_paid_models = True
main(prompt, conversation_ctx, optimization, runtime_info, )

🚀 Starting Ollama server...
→ Ollama PID: 14513
⏳ Waiting for Ollama to be ready…
Ready!

🚀 Pulling model 'gpt-oss:20b'…
All done setting up Ollama (ChatOllama).


=== Router Decision ===
Chosen: gpt-oss:20b (provider=ollama)
Notes:  Local, medium size general model
Complex prompt? True; specialization: math
Optimization: cost
Resources snapshot: {'cpu_percent': 0.0, 'available_gb': 11.1}

=== Model Output (truncated) ===
**Proof that \(\sqrt{2}\) is irrational**

Assume, for contradiction, that \(\sqrt{2}\) can be written as a fraction of two integers in lowest terms:
\[
\sqrt{2}=\frac{p}{q}\qquad (p,q\in\mathbb Z,\; \gcd(p,q)=1,\; q\neq 0).
\]

Squaring both sides gives
\[
2=\frac{p^{2}}{q^{2}}\quad\Longrightarrow\quad p^{2}=2q^{2}.
\]

Thus \(p^{2}\) is even, so \(p\) itself must be even (the square of an odd number is odd).  
Write \(p=2k\) for some integer \(k\). Substituting back:
\[
(2k)^{2}=2q^{2}\quad\Longrightarrow\quad 4k^{2}=2q^{2}\quad\Longrightarrow\quad q^{2}=2k^{2}.
\]


### Optimize speed: Math puzzle


In [22]:
# Sample inputs (edit these to experiment)
prompt = (
    "Prove that the squary root of 2 is irrational."
    "Then simplify it in one sentence for a 6 year old."
)
conversation_ctx = [{"role": "system", "content": "You are concise."}]
optimization = "speed"  # try 'speed', 'cost', or 'depth'

# Dummy runtime info: pretend OpenAI quota is healthy; Ollama is unlimited.
runtime_info = {"api_quotas": {"openai_remaining": 500, "ollama_remaining": 9999}}
allow_paid_models = True
main(prompt, conversation_ctx, optimization, runtime_info, )

🚀 Setting up remote OpenAI chat model…
All done setting up OpenAI (ChatOpenAI).


=== Router Decision ===
Chosen: gpt-4o (provider=openai)
Notes:  Remote, strong reasoning
Complex prompt? True; specialization: math
Optimization: speed
Resources snapshot: {'cpu_percent': 0.0, 'available_gb': 9.64}

=== Model Output (truncated) ===
To prove that the square root of 2 is irrational, we can use a proof by contradiction. Assume that the square root of 2 is rational, meaning it can be expressed as a fraction \( \frac{a}{b} \) where \( a \) and \( b \) are integers with no common factors other than 1, and \( b \neq 0 \). Then we have:

\[
\sqrt{2} = \frac{a}{b}
\]

Squaring both sides gives:

\[
2 = \frac{a^2}{b^2}
\]

Multiplying both sides by \( b^2 \) gives:

\[
2b^2 = a^2
\]

This implies that \( a^2 \) is even, which means \( a \) must also be even (since the square of an odd number is odd). So, let \( a = 2k \) for some integer \( k \). Substituting back, we get:

\[
2b^2 = (2k)^2 = 4k^2

### Optimize speed: Math puzzle
Excluding paid remote models, sticking with local models.

In [23]:
# Sample inputs (edit these to experiment)
prompt = (
    "Prove that the squary root of 2 is irrational."
    "Then simplify it in one sentence for a 6 year old."
)
conversation_ctx = [{"role": "system", "content": "You are concise."}]
optimization = "cost"  # try 'speed', 'cost', or 'depth'

# Dummy runtime info: pretend OpenAI quota is healthy; Ollama is unlimited.
runtime_info = {"api_quotas": {"openai_remaining": 500, "ollama_remaining": 9999}}
allow_paid_models = False
main(prompt, conversation_ctx, optimization, runtime_info, )

🚀 Starting Ollama server...
→ Ollama PID: 15046
⏳ Waiting for Ollama to be ready…
Ready!

🚀 Pulling model 'gpt-oss:20b'…
All done setting up Ollama (ChatOllama).


=== Router Decision ===
Chosen: gpt-oss:20b (provider=ollama)
Notes:  Local, medium size general model
Complex prompt? True; specialization: math
Optimization: cost
Resources snapshot: {'cpu_percent': 2.5, 'available_gb': 9.64}

=== Model Output (truncated) ===
**Proof that \(\sqrt{2}\) is irrational**

Assume, for contradiction, that \(\sqrt{2}\) can be written as a fraction of two integers in lowest terms:
\[
\sqrt{2}=\frac{p}{q}\qquad (p,q\in\mathbb Z,\; \gcd(p,q)=1,\; q\neq 0).
\]

Squaring both sides gives
\[
2=\frac{p^{2}}{q^{2}}\quad\Longrightarrow\quad p^{2}=2q^{2}.
\]

Thus \(p^{2}\) is even, so \(p\) itself must be even (the square of an odd number is odd).  
Write \(p=2k\) for some integer \(k\). Substituting back:
\[
(2k)^{2}=2q^{2}\quad\Longrightarrow\quad 4k^{2}=2q^{2}\quad\Longrightarrow\quad q^{2}=2k^{2}.
\]
