# HERD Demo Notebook

This notebook provides a step-by-step process to run the HERD prototype on your machine. Because HERD is still in its infancy, this notebook is currently limited in its functionality. This notebook lays the foundation for what HERD will look like as we move towards HERDS first official release, providing examples and information about how HERD is built. 

## Step 0: System Level Requirements

HERD's underlying architecture heavily relies on Kubernetes to host experts. The easiest way to ensure Kubernetes is accessible on your system is to install Docker Desktop, navigate to your settings, and enable Kubernetes. Apart from this, all system-level requirements are handled using Docker-Compose files, so as long as you have Docker installed on your machine, you will be able to host and run HERD (given your machine has standard memory and compute available). HERD also relies on HELM. To install HELM run the following commands:

How much memory/compute HERD uses is entirely up to you, meaning that memory and GPU/CPU access are not limiting factors. This demo will be geared to run on CPU with low-memory models for accessibility, but you can easily clone the repo and work with the Kubernetes chart to scale HERD across compute however you would like. 



## Step 1: Starting HERD

Unlike traditional MoE models, HERD instances exist as servers to enable dynamic insertion/deletion of experts and modular use of the Router and Aggregator. To run HERD on your local machine, you need to start up the HERD server using the Python cell located below. Starting up the HERD server will post API entry-points as well as setting up the k8s cluster for experts. For this demo, you can startup the server by simply navigating to the server directory and running `main.py`

## Step 2: Loading up Experts

The HERD server has an endpoint setup to load experts into tje Kubernetes cluster. When loading an expert you can set the name, model id, token limit, temperature, and port on the Kubernetes cluster. The python cell below gives you access to the API endpoint assuming your cluster is accessible through localhost. 

In [None]:
#Loading in an expert. 

import requests

name = input("Enter expert name: ")
model_id = input("Enter model ID (e.g., sshleifer/tiny-gpt2): ")
max_new_tokens = int(input("Enter max new tokens (e.g., 50): "))
temperature = float(input("Enter temperature (e.g., 0.7): "))
node_port = int(input("Enter node port (e.g., 30088): "))

url = "http://localhost:80/add_expert"

payload = {
    "name": name,
    "model_id": model_id,
    "max_new_tokens": str(max_new_tokens),
    "temperature": str(temperature),
    "node_port": node_port
}

headers = {
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

print("Status Code:", response.status_code)
print("Response:", response.json())
