# Week 6 Assignment: Extending the Voice Agent with Function Calling.
This week, we dive into a crucial capability of modern AI agents: function calling. Real-world agents are not limited to generating text—they interact with tools, run computations, query databases, and more. In this assignment, you will extend the multi-turn voice agent you built in Week 3 with the ability to automatically execute tools based on natural language commands.

Using Llama 3 as your core LLM, you’ll teach the model to recognize when a user wants to search arXiv papers or perform a math calculation, and respond by outputting a structured function call. Your agent will then parse that function call, execute the appropriate tool, and return the result to the user via text-to-speech. This is a major step toward building a fully autonomous research assistant that can act on intent—not just reply with facts.



## 📚 Learning Objectives

* **Function Calling with LLMs:** Learn how to use function/tool calling by prompting Llama 3 to output structured calls (e.g., JSON) for external functions.
* **Intent Parsing and Tool Mapping:** Practice parsing user queries to determine the intent (e.g., search a database or perform a calculation) and mapping that intent to specific tool functions like `search_arxiv(query)` and `calculate(expression)`.
* **Integrating Tools into the Agent:** Extend the Week 3 multi-turn voice agent pipeline (ASR → LLM → TTS) so that the LLM can trigger code execution. The agent should automatically call the right function based on the LLM’s output, and then speak the returned result.



## Project Design

Reuse the voice-chat pipeline from Week 3 and enhance the LLM step to support calling external tools. The key tasks include:

* **Tool Functions:** Implement two helper functions:


  * `search_arxiv(query: str) -> str`: Simulates or performs an arXiv search and returns a relevant passage or summary for the query.
  * `calculate(expression: str) -> str`: Evaluates a mathematical expression (using `sympy` or `eval`) and returns the result as text.

* **Prompt Engineering:** Modify the Llama 3 system/user prompts so the model knows to generate structured function-call outputs when appropriate. For example, instruct it that if the user’s question can be answered by searching arXiv or doing math, it should output a JSON-like call, such as:

  ```json
  {"function": "calculate", "arguments": {"expression": "2+2"}}
  ```

  or

  ```json
  {"function": "search_arxiv", "arguments": {"query": "quantum entanglement"}}
  ```

  Otherwise it should respond normally in text.

* **Detecting and Calling Tools:** After the LLM generates a response, check if it is a function call. Parse the JSON output from the LLM to extract the function name and arguments. If it is a call, invoke the corresponding Python function (`search_arxiv` or `calculate`) with those arguments and capture its result. Use this result as the assistant’s reply (to be spoken by TTS). If the LLM output is normal text, just use it as the assistant’s response without calling any function.

* **Fallback Behavior:** Ensure the voice agent handles all cases. If the LLM’s output cannot be parsed as a function call (or if the named function is unknown), fall back to replying with a standard text response or an error message as appropriate.


## Setting

### Create and activate virtual environment


(base) C:\Users\ch939>cd C:\Users\ch939\Downloads\LLMBootCampCodes\Week6

(base) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>conda create -n vx_venv python=3.10 -y

(base) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>conda activate vx_venv

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6\bitsandbytes-windows>pip install bitsandbytes-windows

### Install required packages


(mod6venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>conda install pip -y

(mod6venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

(mod6venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>pip install python-dotenv

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>conda install -c conda-forge transformers sentencepiece

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>conda install mkl=2021.4.0


(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>pip install accelerate

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>conda config --add channels conda-forge

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>conda config --set channel_priority strict


Math

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>conda install -c conda-forge sympy

APIs

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>conda install -c conda-forge huggingface_hub

(mod6venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>pip install anthropic

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>pip install openai

FastAPI

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>conda install -c conda-forge fastapi uvicorn python-multipart


Audio

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>conda install -c conda-forge pydub

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>pip install SpeechRecognition pyttsx3

Optional

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>pip install arxiv


After switch to model_id = "mistralai/Mistral-7B-Instruct-v0.3" and "alokabhishek/Meta-Llama-3-8B-Instruct-bnb-4bit"

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6\bitsandbytes-windows>pip install protobuf

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>conda install -c conda-forge ipywidgets

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>conda config --set channel_priority strict

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>pip install --upgrade bitsandbytes

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>pip show bitsandbytes

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>pip install arxiv

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>pip install pyaudio

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>conda install -c conda-forge protobuf

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>pip install SpeechRecognition

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>pip install pyttsx3

(vx_venv) C:\Users\ch939\Downloads\LLMBootCampCodes\Week6>pip install fastapi uvicorn

My system—Windows 11, RTX 4070 SUPER—offers solid GPU support and abundant VRAM (~12 GB).


## Starter Code

We provide snippets to help you get started:


Check: GPU & CUDA Availability in PyTorch

In [1]:
import torch

print("CUDA Available:", torch.cuda.is_available())
print("CUDA Version:", torch.version.cuda)
print("GPU Count:", torch.cuda.device_count())
print("Current Device:", torch.cuda.current_device())
print("Device Name:", torch.cuda.get_device_name(0))

CUDA Available: True
CUDA Version: 11.8
GPU Count: 1
Current Device: 0
Device Name: NVIDIA GeForce RTX 4070 SUPER


Setting and dependencies

In [2]:
# API clients
try:
    import openai
    from openai import OpenAI
    OPENAI_AVAILABLE = True
    print("✅ OpenAI available")
except ImportError:
    OPENAI_AVAILABLE = False
    print("⚠️ OpenAI not installed. Run: pip install openai")

try:
    import anthropic
    ANTHROPIC_AVAILABLE = True
    print("✅ Anthropic available")
except ImportError:
    ANTHROPIC_AVAILABLE = False
    print("⚠️ Anthropic not installed. Run: pip install anthropic")

# Load environment variables
import os
from dotenv import load_dotenv
load_dotenv(r"C:\Users\ch939\Downloads\LLMBootCampCodes\Week6\.env")

hf_token = os.getenv("HUGGINGFACE_TOKEN")
openai_key = os.getenv("OPENAI_API_KEY")
temp = hf_token

print("🔑 Hugging Face Token:", (hf_token[:5] + "...") if hf_token else "Not found")
print("🔑 OpenAI API Key:", (openai_key[:5] + "...") if openai_key else "Not found")


# 💡 System & Hardware Info
import sys
import torch
print("\n📋 System Information:")
print(f"✅ Python Version: {sys.version}")
print(f"✅ PyTorch Version: {torch.__version__}")
print(f"✅ CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"✅ GPU Device: {torch.cuda.get_device_name(0)}")
else:
    print("✅ GPU Device: No GPU detected")

# ✅ Final Confirmation
print("\n🎉 All dependencies and systems are ready!")

⚠️ OpenAI not installed. Run: pip install openai
⚠️ Anthropic not installed. Run: pip install anthropic
🔑 Hugging Face Token: hf_OI...
🔑 OpenAI API Key: sk-pr...

📋 System Information:
✅ Python Version: 3.10.18 | packaged by conda-forge | (main, Jun  4 2025, 14:42:04) [MSC v.1943 64 bit (AMD64)]
✅ PyTorch Version: 2.5.1
✅ CUDA Available: True
✅ GPU Device: NVIDIA GeForce RTX 4070 SUPER

🎉 All dependencies and systems are ready!


Check: Can You Load Llama 3?

Your request to access this repository has been submitted and is awaiting a review from the repository authors. 
istly, Need the approval here https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

In [10]:

from huggingface_hub import login
login(token = os.getenv("HUGGINGFACE_TOKEN"))



from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig

# model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
# model_id = "mistralai/Mistral-7B-Instruct-v0.3"
# model_id = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
# model_id = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
# model_id = "TheBloke/OpenHermes-2-Mistral-7B-GPTQ"
# model_id = "Qwen/Qwen1.5-7B-Chat-AWQ"


# Step 1: Define quantization config
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,  # or load_in_8bit=True
    bnb_4bit_compute_dtype="float16",  # adjust as needed
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)


# Step 2: Load model and tokenizer with quantization config
model_id = "alokabhishek/Meta-Llama-3-8B-Instruct-bnb-4bit"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quant_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id, token = temp)


# Step 3: Create pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=128,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0


## Project Structure

Week6/
│
├── vx_venv/                  # virtual env
├── voice_agent.py            # main agent logic
├── tools.py                  # search_arxiv, calculate
├── prompts.py                # system prompt with function instructions
├── .env                      # HF_TOKEN, OPENAI_API_KEY
└── test_logs.txt             # sample logs (deliverable)

#### tools.py

In [3]:
# tools.py
import arxiv
from sympy import sympify
from typing import Dict, Any

def search_arxiv(query: str) -> str:
    """Search arXiv and return paper titles or abstract snippets."""
    try:
        search = arxiv.Search(
            query=query,
            max_results=1,
            sort_by=arxiv.SortCriterion.Relevance
        )
        result = next(search.results(), None)
        if result:
            return f"Top result: '{result.title}'. Abstract: {result.summary[:200]}..."
        else:
            return f"No papers found for '{query}'."
    except Exception as e:
        return f"Error searching arXiv: {e}"

def calculate(expression: str) -> str:
    """Safely evaluate mathematical expressions."""
    try:
        result = sympify(expression)
        return str(result.evalf()) if result.is_number else str(result)
    except Exception as e:
        return f"Error: {e}"

#### prompts.py

In [4]:
# prompts.py
SYSTEM_PROMPT = """
You are a helpful AI research assistant with access to tools. 
If the user asks to search for academic papers, respond with a JSON function call:
{"function": "search_arxiv", "arguments": {"query": "your query"}}

If the user asks to calculate a math expression, respond with:
{"function": "calculate", "arguments": {"expression": "2+2"}}

Only output JSON when calling a function. Otherwise, respond naturally in plain text.
Do not add any extra text before or after the JSON.
""".strip()

#### voice_agent.py  using meta-llama/Meta-Llama-3-8B-Instruct

The 1st option : meta-llama/Meta-Llama-3-7B-Instruct is not available, meta-llama/Meta-Llama-3-8B-Instruct have to be used, it will be slow with offload, allow some layers to run on CPU with fp32 offload:

In [16]:
import json
import os
import re
import torch
from dotenv import load_dotenv
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    pipeline,
)

# =========================
# Environment & constants
# =========================
WORKDIR = r"C:\Users\ch939\Downloads\LLMBootCampCodes\Week6"

# Load API keys
load_dotenv(r"C:\Users\ch939\Downloads\LLMBootCampCodes\Week6\.env")
HF_TOKEN = os.getenv("HF_TOKEN")
if not HF_TOKEN:
    print("[Warn] HF_TOKEN is not set. Set it in Week6/.env to access Meta models.")

# =========================
# Model selection (Llama 3 8B with CPU fp32 offload)
# =========================
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

# Quantization/offload:
# - 8-bit weights to reduce VRAM pressure
# - enable fp32 CPU offload for layers that don't fit in VRAM
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_enable_fp32_cpu_offload=True,
)

# Optional CUDA configuration tweaks
if torch.cuda.is_available():
    torch.backends.cudnn.benchmark = True
    torch.backends.cuda.matmul.allow_tf32 = True

# =========================
# Load model & tokenizer
# =========================
print(f"\n[Init] Loading {MODEL_ID} with 8-bit + CPU fp32 offload…")

# Tokenizer
# Note: Access to Meta Llama models on HF requires accepting the license
# and setting HF_TOKEN in your environment.
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, token=HF_TOKEN)

# Ensure pad token is set (Llama uses eos as pad)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Model with automatic layer placement and CPU offload fallback
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",                # let Accelerate split across GPU/CPU
    token=HF_TOKEN,
)

# Text-generation pipeline
pipe = pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,   # only return the completion
)

# =========================
# Tools & prompting
# =========================
from tools import search_arxiv, calculate
from prompts import SYSTEM_PROMPT

TOOLS = {
    "search_arxiv": search_arxiv,
    "calculate": calculate,
}

TOOL_SPEC = (
    "You have access to tools. If and only if a tool is needed, output a single JSON object on one line with keys 'function' and 'arguments'. "
    "The value of 'function' must be one of: 'search_arxiv', 'calculate'. "
    "The value of 'arguments' must be a JSON object with the call parameters. "
    "Do not add other keys. Do not add prose before or after the JSON."
)


def build_prompt(user_text: str) -> str:
    """Build a chat-formatted prompt for Llama 3 with a tool-use contract."""
    system_msg = f"{SYSTEM_PROMPT}\n\n{TOOL_SPEC}"
    messages = [
        {"role": "system", "content": system_msg},
        {"role": "user", "content": user_text},
    ]
    return tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )


_JSON_BLOCK_RE = re.compile(r"```(?:json)?\s*(\{[\s\S]*?\})\s*```", re.IGNORECASE)


def _extract_first_json(text: str) -> str | None:
    """Return the first JSON object found in text, supporting fenced blocks."""
    m = _JSON_BLOCK_RE.search(text)
    if m:
        return m.group(1)
    # Fallback: naive brace matching of the first {...}
    brace_stack = 0
    start = None
    for i, ch in enumerate(text):
        if ch == '{':
            if brace_stack == 0:
                start = i
            brace_stack += 1
        elif ch == '}':
            if brace_stack > 0:
                brace_stack -= 1
                if brace_stack == 0 and start is not None:
                    return text[start:i+1]
    return None


def route_llm_output(llm_output: str) -> str:
    """Parse potential tool call and route to Python functions."""
    llm_output = llm_output.strip()
    candidate = _extract_first_json(llm_output) or llm_output

    try:
        call = json.loads(candidate)
        func_name = call.get("function")
        args = call.get("arguments", {})

        if func_name in TOOLS:
            try:
                result = TOOLS[func_name](**args)
            except TypeError as e:
                return f"Error: bad arguments for '{func_name}': {e}"
            except Exception as e:
                return f"Error while executing '{func_name}': {e}"
            return result if isinstance(result, str) else json.dumps(result, ensure_ascii=False)
        else:
            return llm_output
    except json.JSONDecodeError:
        return llm_output


def agent_query(
    user_text: str,
    max_new_tokens: int = 256,
    temperature: float = 0.6,
    top_p: float = 0.9,
) -> str:
    """Main agent loop with tool-routing and CPU offload-friendly generation."""
    prompt = build_prompt(user_text)

    outputs = pipe(
        prompt,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=temperature,
        top_p=top_p,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )

    assistant_reply = outputs[0]["generated_text"].strip()
    final = route_llm_output(assistant_reply)
    return final


if __name__ == "__main__":
    print("\n[Test] Arithmetic via tool call:")
    print(agent_query("What is 5 + 3 * 2? Use the calculator tool if available."))

    print("\n[Test] ArXiv search via tool call:")
    print(agent_query("Search arXiv for recent papers on quantum entanglement, last 2 years, return 3 items."))

    print("\n[Test] Small talk (no tools):")
    print(agent_query("Hello, how are you?"))


[Warn] HF_TOKEN is not set. Set it in Week5/.env to access Meta models.

[Init] Loading meta-llama/Meta-Llama-3-8B-Instruct with 8-bit + CPU fp32 offload…


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the cpu.
Device set to use cuda:0



[Test] Arithmetic via tool call:
11.0000000000000

[Test] ArXiv search via tool call:
Error: bad arguments for 'search_arxiv': search_arxiv() got an unexpected keyword argument 'max_results'

[Test] Small talk (no tools):
I'm doing well, thanks for asking! I'm a helpful AI research assistant, here to assist you with your queries. What can I help you with today?


### voice_agent: The 2nd option using mistralai/Mistral-7B-Instruct-v0.3

voice_agent using model: mistralai/Mistral-7B-Instruct-v0.3

In [None]:
import json
import os
import re
import torch
from dotenv import load_dotenv
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    pipeline,
)

# =========================
# Environment & constants
# =========================
# Working dir (for clarity in logs or future file ops)
WORKDIR = r"C:\Users\ch939\Downloads\LLMBootCampCodes\Week6"

# Load API keys from Week5 .env as provided
load_dotenv(r"C:\Users\ch939\Downloads\LLMBootCampCodes\Week6\.env")
HF_TOKEN = os.getenv("HF_TOKEN")

# =========================
# Model selection (fits 12GB VRAM smoothly)
# =========================
MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"

# Quantization: 4-bit for headroom on RTX 4070 SUPER (12 GB)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

# =========================
# Load model & tokenizer
# =========================
print(f"\n[Init] Loading {MODEL_ID} in 4-bit…")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, token=HF_TOKEN)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    token=HF_TOKEN,
)

# Text-generation pipeline
pipe = pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,   # only return the completion
)

# =========================
# Tools & prompting
# =========================
from tools import search_arxiv, calculate
from prompts import SYSTEM_PROMPT

TOOLS = {
    "search_arxiv": search_arxiv,
    "calculate": calculate,
}

# A compact, explicit tool contract the model can follow
TOOL_SPEC = (
    "You have access to tools. If and only if a tool is needed, output a single JSON object on one line with keys 'function' and 'arguments'. "
    "The value of 'function' must be one of: 'search_arxiv', 'calculate'. "
    "The value of 'arguments' must be a JSON object with the call parameters. "
    "Do not add other keys. Do not add prose before or after the JSON."
)


def build_prompt(user_text: str) -> str:
    """Use the chat template for Mistral-Instruct.
    System includes function-calling contract and the user's system prompt.
    """
    system_msg = f"{SYSTEM_PROMPT}\n\n{TOOL_SPEC}"
    messages = [
        {"role": "system", "content": system_msg},
        {"role": "user", "content": user_text},
    ]
    return tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )


_JSON_BLOCK_RE = re.compile(r"```(?:json)?\s*(\{[\s\S]*?\})\s*```", re.IGNORECASE)


def _extract_first_json(text: str) -> str | None:
    """Return the first JSON object found in text, supporting fenced blocks.
    If none found, return None.
    """
    # Try fenced JSON first
    m = _JSON_BLOCK_RE.search(text)
    if m:
        return m.group(1)
    # Fallback: naive brace matching of the first {...}
    brace_stack = 0
    start = None
    for i, ch in enumerate(text):
        if ch == '{':
            if brace_stack == 0:
                start = i
            brace_stack += 1
        elif ch == '}':
            if brace_stack > 0:
                brace_stack -= 1
                if brace_stack == 0 and start is not None:
                    return text[start:i+1]
    return None


def route_llm_output(llm_output: str) -> str:
    """Parse potential tool call and route to Python functions.
    - Expects a single-line JSON object like {"function": "name", "arguments": {...}}
    - If not JSON, return the original text.
    """
    llm_output = llm_output.strip()

    candidate = _extract_first_json(llm_output) or llm_output

    try:
        call = json.loads(candidate)
        func_name = call.get("function")
        args = call.get("arguments", {})

        if func_name in TOOLS:
            try:
                result = TOOLS[func_name](**args)
            except TypeError as e:
                return f"Error: bad arguments for '{func_name}': {e}"
            except Exception as e:
                return f"Error while executing '{func_name}': {e}"
            return result if isinstance(result, str) else json.dumps(result, ensure_ascii=False)
        else:
            # Not a known function — just return original assistant text
            return llm_output
    except json.JSONDecodeError:
        # Not JSON — treat as a normal assistant reply
        return llm_output


def agent_query(user_text: str, max_new_tokens: int = 256, temperature: float = 0.6, top_p: float = 0.9) -> str:
    """Main agent loop.
    - Builds a structured chat prompt with a strict tool-calling contract.
    - Generates a response with Mistral-7B-Instruct.
    - Routes tool calls if present; otherwise returns model text.
    """
    prompt = build_prompt(user_text)

    outputs = pipe(
        prompt,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=temperature,
        top_p=top_p,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )

    assistant_reply = outputs[0]["generated_text"].strip()

    # Try tool routing
    final = route_llm_output(assistant_reply)
    return final


# =========================
# Quick smoke tests
# =========================
if __name__ == "__main__":
    print("\n[Test] Arithmetic via tool call:")
    print(agent_query("What is 5 + 3 * 2? Use the calculator tool if available."))

    print("\n[Test] ArXiv search via tool call:")
    print(agent_query("Search arXiv for recent papers on quantum entanglement, last 2 years, return 3 items."))

    print("\n[Test] Small talk (no tools):")
    print(agent_query("Hello, how are you?"))



[Init] Loading mistralai/Mistral-7B-Instruct-v0.3 in 4-bit…


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Device set to use cuda:0



[Test] Arithmetic via tool call:
11.0000000000000

[Test] ArXiv search via tool call:
Error: bad arguments for 'search_arxiv': search_arxiv() got an unexpected keyword argument 'year_start'

[Test] Small talk (no tools):
Top result: 'TrackMeNot-so-good-after-all'. Abstract: TrackMeNot is a Firefox plugin with laudable intentions - protecting your
privacy. By issuing a customizable stream of random search queries on its
users' behalf, TrackMeNot surmises that enough searc...


#### Re-use your Week 3 ASR/TTS pipeline:

In [None]:
import speech_recognition as sr
import pyttsx3

def listen():
    r = sr.Recognizer()
    with sr.Microphone() as source:
        print("Listening...")
        audio = r.listen(source)
        try:
            return r.recognize_google(audio)
        except:
            return "Sorry, I didn't catch that."

def speak(text):
    engine = pyttsx3.init()
    engine.say(text)
    engine.runAndWait()

#### Then intege

In [None]:
query = listen()
response = agent_query(query)
speak(response)

Listening...


In [6]:
# Function tool stubs (starter implementations)
def search_arxiv(query: str) -> str:
    """
    Simulate an arXiv search or return a dummy passage for the given query.
    In a real system, this might query the arXiv API and extract a summary.
    """
    # Example placeholder implementation:
    return f"[arXiv snippet related to '{query}']"

def calculate(expression: str) -> str:
    """
    Evaluate a mathematical expression and return the result as a string.
    """
    try:
        from sympy import sympify
        result = sympify(expression)  # use sympy for safe evaluation
        return str(result)
    except Exception as e:
        return f"Error: {e}"

In [9]:
# Dialogue engine: function-routing logic
import json

def route_llm_output(llm_output: str) -> str:
    """
    Route LLM response to the correct tool if it's a function call, else return the text.
    Expects LLM output in JSON format like {'function': ..., 'arguments': {...}}.
    """
    try:
        output = json.loads(llm_output)
        func_name = output.get("function")
        args = output.get("arguments", {})
    except (json.JSONDecodeError, TypeError):
        # Not a JSON function call; return the text directly
        return llm_output

    if func_name == "search_arxiv":
        query = args.get("query", "")
        return search_arxiv(query)
    elif func_name == "calculate":
        expr = args.get("expression", "")
        return calculate(expr)
    else:
        return f"Error: Unknown function '{func_name}'"


In [8]:
# Example FastAPI endpoint (sketch)
from fastapi import FastAPI
app = FastAPI()

@app.post("/api/voice-query/")
async def voice_query_endpoint(request: dict):
    # Assume request has 'text': the user's query string
    user_text = request.get("text", "")
    # Call Llama 3 model (instructed to output function calls when needed)
    llm_response = llama3_chat_model(user_text)
    # Process LLM output and possibly call tools
    reply_text = route_llm_output(llm_response)
    # Convert reply_text to speech (TTS) and return it
    return {"response": reply_text}


The above code outlines where to plug in your LLM call and audio I/O. Integrate the function-calling logic into your existing voice agent framework.



## Deliverables

* **Codebase:** Submit your updated voice agent code with function calling integration. Document any new modules or changes clearly.
* **Test Logs:** Provide sample logs for at least three queries, showing:

  1. The user’s query text.
  2. The raw LLM response (JSON function call or normal text).
  3. Any function call made and its output.
  4. The final assistant response.
* **Demo Video:** A 1–2 minute demo of the voice agent. Show the agent handling:

  * A math query (invoking `calculate`).
  * An arXiv search query (invoking `search_arxiv`).
  * A normal query (no function call).

## Exploration Tips

* **Extend Tools:** Try adding new tools (e.g. a weather lookup or translation). Define their function signatures and integrate them into your agent.
* **Tool Registry:** Create a dictionary or registry of function names to callables to simplify routing logic when you have multiple tools.
* **Other LLMs:** Experiment with other models that support function calling (e.g. GPT-4 with its function calling API). Compare how their output format and reliability differ from Llama 3.
* **Error Handling:** Make sure your agent handles invalid inputs gracefully (e.g. a malformed math expression should not crash the agent).
* **Chained Calls (Advanced):** As a challenge, allow the agent to use one tool’s output as context for another. For example, it could `search_arxiv` for a value and then `calculate` something with that value.
