# HuggingChat **Omni** + MCP: Multimodal Agentic Workflows (Tutorial)

This follow‑up notebook shows how to build **2–3 agentic, multimodal workflows** that *route* a task to the best open model, and expose those capabilities through an **MCP (Model Context Protocol) server** so tools like Claude Desktop / Cursor / VS Code clients can call them.

**Workflows included**
1. **Code Generation + Reflection Review** (auto‑routing → draft → self‑critique → revision, optional quick tests)
2. **Media Generation & Refinement** (image **and** audio flows with iterative feedback)
3. **Web‑Scraping → Data Extraction → Structured Output** (CSV/JSON)

> ℹ️ *Omni routing concept:* HuggingChat **Omni** picks a good open model for each request. Below we replicate that *idea* locally using a small router model (default: `katanemo/Arch-Router-1.5B`) plus simple policies, then call Hugging Face Inference API models per task. Switch models as you like.

**What you’ll get**
- A minimal **router** + **agents** for text / image / audio / web data
- A runnable **MCP server** exposing tools: `infer_intent`, `code_generate`, `code_review`, `image_generate`, `audio_tts`, `asr_transcribe`, `web_search`, `web_fetch`, `extract_table`
- End‑to‑end examples you can run as plain Python or via MCP clients

---
**Setup notes**
- You’ll need a **Hugging Face token** with Inference API access: set `HF_TOKEN` as an env var.
- Some cells call `pip install` for optional extras (beautifulsoup4, readability-lxml, duckduckgo-search, mcp, etc.).
- The *router* cell can use an LLM router (`katanemo/Arch-Router-1.5B`) or fallback heuristics.


In [None]:
# --- Install (optional) ---
# If you already have these, you can skip.
%%bash
pip -q install --upgrade huggingface_hub httpx
pip -q install beautifulsoup4 readability-lxml duckduckgo-search pandas matplotlib python-dotenv
pip -q install "mcp>=1.2.0"  # Model Context Protocol server SDK


In [None]:
# --- Imports & basic config ---
import os, io, json, time, base64, textwrap, pathlib, tempfile
from dataclasses import dataclass
from typing import Optional, Dict, Any, List, Tuple
import httpx
from huggingface_hub import InferenceClient
from dotenv import load_dotenv
load_dotenv()

HF_TOKEN = os.getenv("HF_TOKEN", "")
if not HF_TOKEN:
    print("⚠️ Set HF_TOKEN in your environment to enable real API calls.")

# A simple helper to create an HF Inference client
def hf_client(base_url: Optional[str] = None):
    if base_url:
        return InferenceClient(token=HF_TOKEN, base_url=base_url)
    return InferenceClient(token=HF_TOKEN)

# Save artifacts (images/audio/CSVs)
ARTIFACTS_DIR = pathlib.Path("artifacts")
ARTIFACTS_DIR.mkdir(exist_ok=True)
ARTIFACTS_DIR.resolve()


## 0) Router: "Omni"-style selection

We’ll use a small router model by default (`katanemo/Arch-Router-1.5B`) with a simple system prompt that lists our available *routes*. If the router call fails or isn’t available, we fall back to **heuristics** based on keywords.


In [None]:
ROUTER_MODEL = "katanemo/Arch-Router-1.5B"  # used by HuggingChat chat-ui as an optional router

ROUTES = [
    {"name": "code_generation", "description": "Generate code, functions, scripts"},
    {"name": "code_review", "description": "Reflect on code, find bugs, suggest fixes"},
    {"name": "image_generation", "description": "Create or edit images from prompts"},
    {"name": "audio_tts", "description": "Text-to-speech voice generation"},
    {"name": "asr_transcribe", "description": "Speech-to-text transcription"},
    {"name": "web_search", "description": "Find URLs for a topic"},
    {"name": "web_fetch", "description": "Fetch + clean a page"},
    {"name": "extract_table", "description": "Extract tabular data and save CSV"},
    {"name": "general", "description": "General assistant / reasoning"},
]

ROUTER_SYSTEM = f"""
You are a routing model. Given the user's latest request and recent context, reply with JSON only: {{\"route\": \"<exact_route_name>\"}}.
Choose from these routes:
{json.dumps(ROUTES)}
Rules:
1) If request is about code creation, choose code_generation. If code critique or fix, choose code_review.
2) If prompt asks to make an image, choose image_generation. If synthesize voice from text, audio_tts. If transcribe audio, asr_transcribe.
3) If the user asks to find or browse information online, choose web_search; if they give a URL to retrieve/clean, choose web_fetch; if they ask to turn page content into a table/CSV, choose extract_table.
4) Otherwise choose general.
"""

def route_with_model(query: str, history: Optional[List[Dict[str, str]]] = None) -> str:
    """Route using the Arch-Router model. Falls back to heuristics on failure."""
    history = history or []
    try:
        if not HF_TOKEN:
            raise RuntimeError("HF token missing; using heuristic fallback.")
        client = hf_client()
        messages = [{"role": "system", "content": ROUTER_SYSTEM}] + history + [{"role": "user", "content": query}]
        out = client.chat_completion(model=ROUTER_MODEL, messages=messages, max_tokens=64, temperature=0.1)
        text = out.choices[0].message["content"] if hasattr(out.choices[0], "message") else out.choices[0].message.content
        route = json.loads(text).get("route", "general")
        return route
    except Exception as e:
        # Heuristic routing
        q = query.lower()
        if any(k in q for k in ["generate image", "make a picture", "draw", "sdxl", "stable diffusion", "image of"]):
            return "image_generation"
        if any(k in q for k in ["tts", "text to speech", "voice", "say this", "audio generate"]):
            return "audio_tts"
        if any(k in q for k in ["transcribe", "asr", "speech to text", "what is being said"]):
            return "asr_transcribe"
        if any(k in q for k in ["write code", "implement", "function", "script", "class", "boilerplate"]):
            return "code_generation"
        if any(k in q for k in ["fix", "bug", "optimize", "refactor", "review my code", "lint"]):
            return "code_review"
        if any(k in q for k in ["search", "find sources", "web", "google", "duckduckgo"]):
            return "web_search"
        if any(k in q for k in ["fetch", "download page", "scrape", "clean this url", "extract from url", "http://", "https://"]):
            return "web_fetch"
        if any(k in q for k in ["extract table", "convert to csv", "make dataset", "tabular"]):
            return "extract_table"
        return "general"

print("Router ready.")


## 1) Model map (switch per task)

Below is a **default routing table**. Swap any model IDs for your preferences/endpoints.

- Code gen / review: `Qwen2.5-Coder`, `DeepSeek-Coder`, `HuggingFaceH4/zephyr-7b-beta`, etc.
- Image gen: SDXL / Flux family (requires GPU endpoint).
- Audio: a TTS checkpoint and an ASR (Whisper) model.


In [None]:
MODEL_MAP = {
    # text/code
    "code_generation": "Qwen/Qwen2.5-Coder-7B-Instruct",
    "code_review": "HuggingFaceH4/zephyr-7b-beta",
    "general": "mistralai/Mistral-7B-Instruct-v0.3",
    # image
    "image_generation": "stabilityai/stable-diffusion-xl-base-1.0",  # or black-forest-labs/FLUX.1-schnell
    # audio
    "audio_tts": "espnet/kan-bayashi-ljspeech-vits",
    "asr_transcribe": "openai/whisper-large-v3",  # choose a public HF ASR endpoint you can access
}

def pick_model(route: str) -> str:
    return MODEL_MAP.get(route, MODEL_MAP["general"]) 

print(MODEL_MAP)


## 2) HF Inference helpers

Small wrappers for **chat**, **image**, and **audio** calls using the Hugging Face Inference API.


In [None]:
def chat_llm(messages: List[Dict[str, str]], model: Optional[str] = None, **kw) -> str:
    model = model or MODEL_MAP["general"]
    if not HF_TOKEN:
        return "[DRY-RUN] (No HF token) Would call chat_completion on: " + model
    client = hf_client()
    out = client.chat_completion(model=model, messages=messages, temperature=kw.get("temperature", 0.2), max_tokens=kw.get("max_tokens", 1024))
    return out.choices[0].message["content"] if hasattr(out.choices[0], "message") else out.choices[0].message.content

def text_to_image(prompt: str, model: Optional[str] = None, out_name: str = "image.png") -> str:
    model = model or MODEL_MAP["image_generation"]
    if not HF_TOKEN:
        return "[DRY-RUN] (No HF token) Would generate image with: " + model
    url = f"https://api-inference.huggingface.co/models/{model}"
    headers = {"Authorization": f"Bearer {HF_TOKEN}", "Accept": "image/png"}
    payload = {"inputs": prompt}
    r = httpx.post(url, headers=headers, json=payload, timeout=60)
    r.raise_for_status()
    out_path = ARTIFACTS_DIR / out_name
    out_path.write_bytes(r.content)
    return str(out_path)

def tts(text: str, model: Optional[str] = None, out_name: str = "speech.wav") -> str:
    model = model or MODEL_MAP["audio_tts"]
    if not HF_TOKEN:
        return "[DRY-RUN] (No HF token) Would synthesize TTS with: " + model
    client = hf_client()
    audio_bytes = client.text_to_speech(text=text, model=model)
    out_path = ARTIFACTS_DIR / out_name
    out_path.write_bytes(audio_bytes)
    return str(out_path)

def asr(audio_path: str, model: Optional[str] = None) -> str:
    model = model or MODEL_MAP["asr_transcribe"]
    if not HF_TOKEN:
        return "[DRY-RUN] (No HF token) Would transcribe with: " + model
    client = hf_client()
    with open(audio_path, "rb") as f:
        result = client.automatic_speech_recognition(audio=f, model=model)
    return result.get("text") if isinstance(result, dict) else str(result)


## 3) Agents

We’ll define three simple agents and wire them to the router:

- **CodeAgent**: draft → *reflection* critique → revision (optionally run tiny tests)
- **MediaAgent**: image/audio generation with iterative refinement
- **WebAgent**: search → fetch → extract → structure (CSV/JSON)


In [None]:
def code_generate(prompt: str, language: str = "python") -> str:
    model = pick_model("code_generation")
    system = f"You write high-quality, idiomatic {language} code with docstrings and simple tests when asked."
    return chat_llm(
        [{"role": "system", "content": system}, {"role": "user", "content": prompt}],
        model=model
    )

def code_reflect(code_text: str, language: str = "python") -> str:
    model = pick_model("code_review")
    system = f"You are a strict {language} code reviewer. Find bugs, edge cases, complexity issues. Propose concrete, minimal diffs."
    return chat_llm(
        [{"role": "system", "content": system},
         {"role": "user", "content": f"Review this code and propose improved version with reasons.\n\n```{language}\n{code_text}\n```"}],
        model=model
    )

def code_revision(prompt: str, draft: str, review: str, language: str = "python") -> str:
    model = pick_model("code_generation")
    system = f"Revise the code based on the review. Keep it concise, correct, and tested. Language: {language}."
    return chat_llm(
        [{"role": "system", "content": system},
         {"role": "user", "content": f"User request: {prompt}\n\nDraft code:\n```{language}\n{draft}\n```\n\nReview feedback:\n{review}\n\nPlease output only the final improved code in a single fenced block."}],
        model=model
    )

def run_codegen_reflection(user_prompt: str, language: str = "python") -> Dict[str, str]:
    route = route_with_model(user_prompt)
    if route not in ("code_generation", "code_review", "general"):
        print(f"Router suggested {route}; overriding to code_generation for this workflow.")
    draft = code_generate(user_prompt, language)
    review = code_reflect(draft, language)
    final = code_revision(user_prompt, draft, review, language)
    return {"route": route, "draft": draft, "review": review, "final": final}

def media_generate_and_refine(kind: str, prompt: str, feedback: Optional[str] = None) -> Dict[str, str]:
    assert kind in ("image", "audio"), "kind must be 'image' or 'audio'"
    if kind == "image":
        route = route_with_model("generate image: " + prompt)
        path1 = text_to_image(prompt, out_name="image_v1.png")
        if isinstance(path1, str) and path1.startswith("[DRY-RUN]"):
            refined_path = path1
        else:
            fb = feedback or chat_llm(
                [{"role": "system", "content": "You improve prompts for image generation."},
                 {"role": "user", "content": f"Original prompt: {prompt}. Suggest a refined prompt to improve composition/lighting/details."}]
            )
            refined_prompt = f"{prompt}. Refinements: {fb}"
            refined_path = text_to_image(refined_prompt, out_name="image_v2.png")
        return {"route": route, "v1": str(path1), "v2": str(refined_path)}
    else:
        route = route_with_model("tts: " + prompt)
        p1 = tts(prompt, out_name="speech_v1.wav")
        if isinstance(p1, str) and p1.startswith("[DRY-RUN]"):
            p2 = p1
        else:
            fb = feedback or chat_llm(
                [{"role": "system", "content": "You improve TTS prompts: pacing, emphasis, style."},
                 {"role": "user", "content": f"Original text: {prompt}. Suggest SSML-like cues to improve prosody."}]
            )
            refined_text = f"{prompt}\n\nCues: {fb}"
            p2 = tts(refined_text, out_name="speech_v2.wav")
        return {"route": route, "v1": str(p1), "v2": str(p2)}

def web_search(query: str, max_results: int = 5) -> List[Tuple[str, str]]:
    try:
        from duckduckgo_search import DDGS
    except Exception:
        print("Install duckduckgo-search to enable web_search.")
        return []
    results = []
    with DDGS() as ddgs:
        for r in ddgs.text(query, max_results=max_results):
            results.append((r.get("title", ""), r.get("href", "")))
    return results

def web_fetch(url: str) -> Dict[str, str]:
    import bs4, readability
    r = httpx.get(url, timeout=30)
    r.raise_for_status()
    doc = readability.Document(r.text)
    title = doc.short_title()
    html = doc.summary()
    soup = bs4.BeautifulSoup(html, "html.parser")
    text = soup.get_text(" ")
    return {"title": title or "", "text": text}

def extract_table(text: str, out_csv: str = "extracted.csv") -> str:
    import pandas as pd
    system = "You are a data wrangler. Given raw webpage text, extract a small table (up to ~20 rows) in CSV format with headers."
    csv_text = chat_llm(
        [{"role": "system", "content": system},
         {"role": "user", "content": text[:12000]}],
        max_tokens=1200
    )
    try:
        from io import StringIO
        df = pd.read_csv(StringIO(csv_text))
    except Exception:
        df = pd.DataFrame({"extracted": [csv_text]})
    out_path = ARTIFACTS_DIR / out_csv
    df.to_csv(out_path, index=False)
    return str(out_path)


### Quick demos (run locally)

These calls exercise the agents directly (without MCP).

In [None]:
# 1) Code gen + reflection
res = run_codegen_reflection("Write a Python function that returns the nth Fibonacci number iteratively.")
print("ROUTE:", res["route"])
print("\n--- DRAFT ---\n", res["draft"][:400], "...\n")
print("\n--- REVIEW ---\n", res["review"][:400], "...\n")
print("\n--- FINAL ---\n", res["final"][:400], "...\n")

# 2) Image gen + refinement (will save to artifacts/)
img = media_generate_and_refine("image", "a cozy study room at golden hour, volumetric light, photorealistic")
print(img)

# 3) Audio TTS + refinement (will save to artifacts/)
aud = media_generate_and_refine("audio", "Welcome to the Omni + MCP tutorial! This line will be synthesized.")
print(aud)

# 4) Web scraping flow
hits = web_search("latest advances in battery technology site:arxiv.org", max_results=3)
print(hits)
if hits:
    page = web_fetch(hits[0][1])
    csv_path = extract_table(page["text"])  # heuristic LLM extraction
    print("Saved:", csv_path)


## 4) MCP Server (tools for agentic workflows)

We’ll now write a minimal MCP server script to `artifacts/omni_mcp_server.py`. It exposes tools:
- `infer_intent` (router)
- `code_generate`, `code_review`
- `image_generate`
- `audio_tts`, `asr_transcribe`
- `web_search`, `web_fetch`, `extract_table`

**Run it:**
```bash
python artifacts/omni_mcp_server.py
```
Then register this server in your MCP client (Claude Desktop / Cursor / VS Code extension).


In [None]:
mcp_server_code = """# omni_mcp_server.py
import os, json, httpx, base64, tempfile, requests
from typing import List, Dict, Any
from mcp.server import Server
from huggingface_hub import InferenceClient

HF_TOKEN = os.getenv("HF_TOKEN", "")
ROUTER_MODEL = os.getenv("ROUTER_MODEL", "katanemo/Arch-Router-1.5B")

MODEL_MAP = {
    "code_generation": os.getenv("MODEL_CODEGEN", "Qwen/Qwen2.5-Coder-7B-Instruct"),
    "code_review":    os.getenv("MODEL_CODEREVIEW", "HuggingFaceH4/zephyr-7b-beta"),
    "general":        os.getenv("MODEL_GENERAL", "mistralai/Mistral-7B-Instruct-v0.3"),
    "image_generation": os.getenv("MODEL_IMAGE", "stabilityai/stable-diffusion-xl-base-1.0"),
    "audio_tts":        os.getenv("MODEL_TTS", "espnet/kan-bayashi-ljspeech-vits"),
    "asr_transcribe":   os.getenv("MODEL_ASR", "openai/whisper-large-v3"),
}

def hf_client():
    return InferenceClient(token=HF_TOKEN)

def chat_llm(messages: List[Dict[str, str]], model: str) -> str:
    if not HF_TOKEN:
        return "[DRY-RUN] Missing HF_TOKEN."
    out = hf_client().chat_completion(model=model, messages=messages, temperature=0.2, max_tokens=1024)
    return out.choices[0].message["content"] if hasattr(out.choices[0], "message") else out.choices[0].message.content

def route(query: str) -> str:
    if not HF_TOKEN:
        q = query.lower()
        if any(k in q for k in ["image", "draw", "picture"]): return "image_generation"
        if any(k in q for k in ["tts", "voice"]): return "audio_tts"
        if any(k in q for k in ["transcribe", "asr"]): return "asr_transcribe"
        if any(k in q for k in ["function", "script", "class", "code"]): return "code_generation"
        if any(k in q for k in ["review", "refactor", "bugs"]): return "code_review"
        if any(k in q for k in ["search", "google", "find"]): return "web_search"
        if any(k in q for k in ["http://", "https://", "fetch", "scrape"]): return "web_fetch"
        return "general"
    messages = [
        {"role": "system", "content": "You are a router; respond JSON: {\"route\": \"<name>\"}."},
        {"role": "user", "content": query}
    ]
    try:
        text = chat_llm(messages, ROUTER_MODEL)
        return json.loads(text).get("route", "general")
    except Exception:
        return "general"

server = Server("omni-mcp")

@server.tool()
def infer_intent(query: str) -> str:
    """Return a route name for the user's query."""
    return route(query)

@server.tool()
def code_generate(prompt: str, language: str = "python") -> str:
    model = MODEL_MAP["code_generation"]
    sys = f"Write high-quality {language} code with docstrings."
    return chat_llm([{"role": "system", "content": sys}, {"role": "user", "content": prompt}], model)

@server.tool()
def code_review(code: str, language: str = "python") -> str:
    model = MODEL_MAP["code_review"]
    sys = f"Review {language} code; find bugs & propose improved version with reasons."
    return chat_llm([{"role": "system", "content": sys}, {"role": "user", "content": code}], model)

@server.tool()
def image_generate(prompt: str) -> str:
    model = MODEL_MAP["image_generation"]
    if not HF_TOKEN:
        return "[DRY-RUN] Missing HF_TOKEN."
    url = f"https://api-inference.huggingface.co/models/{model}"
    headers = {"Authorization": f"Bearer {HF_TOKEN}", "Accept": "image/png"}
    r = httpx.post(url, headers=headers, json={"inputs": prompt}, timeout=60)
    r.raise_for_status()
    b64 = base64.b64encode(r.content).decode("utf-8")
    return json.dumps({"image_base64": b64})

@server.tool()
def audio_tts(text: str) -> str:
    model = MODEL_MAP["audio_tts"]
    if not HF_TOKEN:
        return "[DRY-RUN] Missing HF_TOKEN."
    audio_bytes = InferenceClient(token=HF_TOKEN).text_to_speech(text=text, model=model)
    return f"AUDIO_BYTES:{base64.b64encode(audio_bytes).decode('utf-8')}"

@server.tool()
def asr_transcribe(path_or_url: str) -> str:
    model = MODEL_MAP["asr_transcribe"]
    if not HF_TOKEN:
        return "[DRY-RUN] Missing HF_TOKEN."
    client = InferenceClient(token=HF_TOKEN)
    if path_or_url.startswith("http"):
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
            tmp.write(requests.get(path_or_url).content)
            audio_path = tmp.name
    else:
        audio_path = path_or_url
    with open(audio_path, 'rb') as f:
        result = client.automatic_speech_recognition(audio=f, model=model)
    return result.get("text") if isinstance(result, dict) else str(result)

@server.tool()
def web_search(query: str, max_results: int = 5) -> str:
    try:
        from duckduckgo_search import DDGS
        hits = []
        with DDGS() as ddgs:
            for r in ddgs.text(query, max_results=max_results):
                hits.append({"title": r.get("title", ""), "url": r.get("href", "")})
        return json.dumps(hits)
    except Exception as e:
        return json.dumps({"error": str(e)})

@server.tool()
def web_fetch(url: str) -> str:
    import bs4, readability
    r = httpx.get(url, timeout=30)
    r.raise_for_status()
    doc = readability.Document(r.text)
    title = doc.short_title()
    html = doc.summary()
    soup = bs4.BeautifulSoup(html, "html.parser")
    text = soup.get_text(" ")
    return json.dumps({"title": title or "", "text": text})

@server.tool()
def extract_table(text: str) -> str:
    messages = [{"role": "system", "content": "Return CSV only."}, {"role": "user", "content": text[:12000]}]
    csv_text = chat_llm(messages, MODEL_MAP["general"])
    return csv_text

if __name__ == "__main__":
    server.run_stdio()
"""

from pathlib import Path
ARTIFACTS_DIR = Path("artifacts"); ARTIFACTS_DIR.mkdir(exist_ok=True)
(ARTIFACTS_DIR / "omni_mcp_server.py").write_text(mcp_server_code, encoding="utf-8")
print("Wrote:", (ARTIFACTS_DIR / "omni_mcp_server.py").resolve())


## 5) Putting it together — three **agentic** workflows

Below are thin, user‑facing wrappers that (a) **infer intent** with the router and (b) hand off to the right agent. You can call them directly or surface them through MCP tools.


In [None]:
def omni_handle(user_request: str, extra: Optional[Dict[str, Any]] = None) -> Dict[str, Any]:
    extra = extra or {}
    route = route_with_model(user_request)
    model = pick_model(route)
    out: Dict[str, Any] = {"route": route, "model": model}
    if route == "code_generation":
        out.update(run_codegen_reflection(user_request))
    elif route == "code_review":
        out["review"] = code_reflect(extra.get("code", user_request))
    elif route == "image_generation":
        out.update(media_generate_and_refine("image", user_request))
    elif route == "audio_tts":
        out.update(media_generate_and_refine("audio", user_request))
    elif route == "asr_transcribe":
        audio_path = extra.get("audio_path", "")
        out["transcript"] = asr(audio_path) if audio_path else "No audio_path provided."
    elif route == "web_search":
        out["results"] = web_search(user_request)
    elif route == "web_fetch":
        url = extra.get("url") or user_request
        page = web_fetch(url)
        out.update(page)
        out["csv_path"] = extract_table(page.get("text", ""))
    else:
        out["answer"] = chat_llm([{"role": "user", "content": user_request}])
    return out

print(omni_handle("Generate an image of a hummingbird in flight at sunrise"))


## 6) Configuration tips & safety

- Set `HF_TOKEN` (and optionally override models with `MODEL_*` env vars) for real calls.
- Respect website **robots** and terms when scraping. Keep request volume low.
- **Images/Audio** via public models may contain artifacts; use iterative refinement.
- **Code exec** is *not* performed here by default; if you add one, sandbox with time/memory limits.


*Notebook generated on: 2025-10-22 03:41:32*