# HERD Demo Notebook

This notebook provides a step-by-step process to run the HERD prototype on your machine. Because HERD is still in its infancy, this notebook is currently limited in its functionality. This notebook lays the foundation for what HERD will look like as we move towards HERDS first official release, providing examples and information about how HERD is built. 

## Step 0: System Level Requirements

HERD's underlying architecture heavily relies on Kubernetes to host experts. The easiest way to ensure Kubernetes is accessible on your system is to install Docker Desktop, navigate to your settings, and enable Kubernetes. Apart from this, all system-level requirements are handled using Docker-Compose files, so as long as you have Docker installed on your machine, you will be able to host and run HERD (given your machine has standard memory and compute available). HERD also relies on HELM. To install HELM run the following commands:

How much memory/compute HERD uses is entirely up to you, meaning that memory and GPU/CPU access are not limiting factors. This demo will be geared to run on CPU with low-memory models for accessibility, but you can easily clone the repo and work with the Kubernetes chart to scale HERD across compute however you would like. 



## Step 1: Starting HERD

Unlike traditional MoE models, HERD instances exist as servers to enable dynamic insertion/deletion of experts and modular use of the Router and Aggregator. To run HERD on your local machine, you need to start up the HERD server using the Python cell located below. Starting up the HERD server will post API entry-points as well as setting up the k8s cluster for experts. For this demo, you can startup the server by simply navigating to the server directory and running `main.py`

## Step 2: Loading up Experts

The HERD server has an endpoint setup to load experts into tje Kubernetes cluster. When loading an expert you can set the name, model id, token limit, temperature, and port on the Kubernetes cluster. The python cell below gives you access to the API endpoint assuming your cluster is accessible through localhost. 

In [None]:
#Loading in an expert. 

import requests

name = input("Enter expert name: ")
model_id = input("Enter model ID (e.g., sshleifer/tiny-gpt2): ")
max_new_tokens = int(input("Enter max new tokens (e.g., 50): "))
temperature = float(input("Enter temperature (e.g., 0.7): "))
node_port = int(input("Enter node port (e.g., 30088): "))

url = "http://localhost:80/add_expert"

payload = {
    "name": name,
    "model_id": model_id,
    "max_new_tokens": str(max_new_tokens),
    "temperature": str(temperature),
    "node_port": node_port
}

headers = {
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

print("Status Code:", response.status_code)
print("Response:", response.json())


## Optional Step: Hitting a specific expert

For general health checks, or in the event that you only need the help of a very specific domain of knowledge, the following route can be used to hit a single expert based on their name. 

In [None]:
import requests

expert_name = input("Enter expert name (e.g., expert1): ")
namespace = input("Enter namespace (default = 'default'): ") or "default"
prompt = input("Enter your prompt: ")
max_new_tokens = int(input("Enter max new tokens (e.g., 50): "))
temperature = float(input("Enter temperature (e.g., 0.7): "))
top_p = float(input("Enter top_p (e.g., 0.95): "))
top_k = int(input("Enter top_k (e.g., 50): "))
repetition_penalty = float(input("Enter repetition penalty (e.g., 1.0): "))

url = f"http://localhost:80/experts/{expert_name}/ask?namespace={namespace}"

payload = {
    "prompt": prompt,
    "max_new_tokens": max_new_tokens,
    "temperature": temperature,
    "top_p": top_p,
    "top_k": top_k,
    "repetition_penalty": repetition_penalty
}

headers = {"Content-Type": "application/json"}

response = requests.post(url, json=payload, headers=headers)

print("Status Code:", response.status_code)
try:
    print("Response:", response.json())
except Exception:
    print("Raw Response:", response.text)


## Step 3: Routing a Query

The block below executes the full HERD pipeline, taking in a query and outputting a final aggregated answer.

In [None]:
import re
import json
import requests
from typing import Dict, List, Any

BASE_URL = "http://localhost:80"  
NAMESPACE = "default"              

CLASSIFY_TOP_K = 5                 
SCORE_THRESHOLD = 0.20            


TOPIC_TO_EXPERT: Dict[str, str] = {
    "Physics": "physics-expert",
    "Math": "math-expert",
}


GEN = {
    "max_new_tokens": 200,
    "temperature": 0.4,
    "top_p": 0.95,
    "top_k": 50,
    "repetition_penalty": 1.05,
}

def slugify_topic_to_expert(topic: str) -> str:
    """Fallback expert name: slugify the topic and append '-expert'."""
    slug = re.sub(r'[^a-zA-Z0-9]+', '-', topic).strip('-').lower()
    return f"{slug}-expert"

def pick_expert_name(topic: str) -> str:
    return TOPIC_TO_EXPERT.get(topic, slugify_topic_to_expert(topic))

def call_create_prompts(text: str, top_k: int) -> Dict[str, Any]:
    """
    POST /create_prompts
    Body: {"text": <query>, "top_k": <int>}
    Returns: {"original": str, "model": str, "topics": [...], "prompts": {...}}
    """
    url = f"{BASE_URL}/create_prompts"
    payload = {"text": text, "top_k": top_k}
    r = requests.post(url, json=payload, headers={"Content-Type": "application/json"})
    r.raise_for_status()
    return r.json()

def call_expert_ask(name: str, prompt: str, gen: Dict[str, Any], namespace: str) -> Dict[str, Any]:
    """
    POST /experts/{name}/ask?namespace={namespace}
    Body: {"prompt": "...", **GEN}
    Returns the routed response payload from your ask_expert_by_name route.
    """
    url = f"{BASE_URL}/experts/{name}/ask"
    params = {"namespace": namespace}
    body = {"prompt": prompt, **gen}
    r = requests.post(url, params=params, json=body, timeout=180)
    r.raise_for_status()
    try:
        return r.json()
    except Exception:
        return {"status_code": r.status_code, "text": r.text}

def extract_text_from_model_response(resp: Dict[str, Any]) -> str:
    """
    Try to normalize common response shapes into plain text for aggregation.
    Supports:
      { "response": { "completion": "..."} }
      { "response": { "text": "..."} }
      or just dumps JSON as fallback
    """
    if isinstance(resp, dict) and "response" in resp and isinstance(resp["response"], dict):
        inner = resp["response"]
        return inner.get("completion") or inner.get("text") or json.dumps(inner, ensure_ascii=False)
    return json.dumps(resp, ensure_ascii=False)

def run_pipeline(user_query: str) -> Dict[str, Any]:
    """
    1) Create prompts (classification + specialization)
    2) Select experts (score ≥ threshold)
    3) Query each selected expert via /experts/{name}/ask
    4) Aggregate outputs
    """
    cp = call_create_prompts(user_query, CLASSIFY_TOP_K)
    topics = cp.get("topics", [])  # each: {"topic": "...", "score": ".../number"}
    prompts_by_topic = {k: v.get("prompt") for k, v in cp.get("prompts", {}).items()}

    selected = [t for t in topics if float(t.get("score", 0.0)) >= SCORE_THRESHOLD]

    per_expert: List[Dict[str, Any]] = []
    for t in selected:
        topic = t["topic"]
        expert_name = pick_expert_name(topic)
        # Prefer specialized prompt; fallback to original query
        prompt_to_send = prompts_by_topic.get(topic, user_query)

        resp = call_expert_ask(expert_name, prompt_to_send, GEN, NAMESPACE)
        text = extract_text_from_model_response(resp)

        per_expert.append({
            "topic": topic,
            "score": float(t.get("score", 0.0)),
            "expert": expert_name,
            "used_specialized_prompt": topic in prompts_by_topic,
            "prompt_sent": prompt_to_send,
            "raw_response": resp,
            "text": text,
        })

    per_expert_sorted = sorted(per_expert, key=lambda x: x["score"], reverse=True)
    combined_answer = "\n\n".join(
        [f"[{e['expert']} • {e['topic']} • score={e['score']:.2f}]\n{e['text']}" for e in per_expert_sorted]
    ) if per_expert_sorted else "(no experts selected)"

    return {
        "query": user_query,
        "topics": topics,
        "selected": [{"topic": e["topic"], "score": e["score"], "expert": e["expert"]} for e in per_expert_sorted],
        "answers": per_expert_sorted,
        "combined_answer": combined_answer,
    }

if __name__ == "__main__":
    q = input("Enter your prompt: ").strip()
    out = run_pipeline(q)

    print("\n== Selected experts ==")
    if not out["selected"]:
        print("None (no topic met the threshold).")
    else:
        for s in out["selected"]:
            print(f" - {s['expert']} (topic={s['topic']}, score={s['score']:.2f})")

    print("\n== Combined Answer ==")
    print(out["combined_answer"])
