# Evaluation for Assistant API

## Introduction

Dataset chosen is the famous `hotspotqa` which is commonly used to evaluate QA and context understanding. 

This notebook is targeted at following goals:

1. Investigate performance of opensource solutions with `mixtral-7bx8` and `LLMCompiler` as function calling strategy.
2. Compares differences between the above solution and the official OpenAI Assistant API (with gpt-3.5-turbo).   


In [98]:
!pip install datasets numpy langchain



In [99]:
%reload_ext autoreload
%autoreload 2

## Prepare dataset

Only hard level questions in [validation split](https://huggingface.co/datasets/scholarly-shadows-syndicate/hotpotqa_with_qa_gpt35/viewer/default/validation) is used in this notebook. 

In [100]:
from datasets import load_dataset

dataset = load_dataset("scholarly-shadows-syndicate/hotpotqa_with_qa_gpt35", split="validation", streaming=True).filter(lambda x: x["level"] == "hard")


## Benchmark runner

* `BenchmarkRunner.run`: load validation dataset and run the QA task, and then save the result to `output_file_path`.
* `Benchmarkrunner.get_metrics`: load runner result from `output_file_path` and calculate metric data.

Only one search tool based on TAVILY API is used during this test and I borrow it from langchain. So make sure that `TAVILY_API_KEY` is set in env variables.

In [101]:
# from langchain.utilities.tavily_search import TavilySearchAPIWrapper
# from langchain.tools.tavily_search import TavilySearchResults
# from langchain_core.utils.function_calling import convert_to_openai_function
from langchain_community.utilities import GoogleSerperAPIWrapper

# from langchain.agents import load_tools

# tools = load_tools(["google-serper"])

# tavily_tool = TavilySearchResults(api_wrapper=TavilySearchAPIWrapper(), max_results=5)
# search_tool_schema = convert_to_openai_function(tools[0])
# search_tool_schema["name"] = "search"
# print(search_tool_schema)

# result = tools[0].run("country with most populations")
# print(result)

# search_result = "\n".join([item["content"] for item in result])

# search_result

from langchain.tools import BaseTool, StructuredTool, tool
from tool.wikipedia import ReActWikipedia, DocstoreExplorer
from langchain.pydantic_v1 import BaseModel, Field
from typing import Optional, Type, Any
from langchain.callbacks.manager import (
    AsyncCallbackManagerForToolRun,
    CallbackManagerForToolRun,
)

class SearchInput(BaseModel):
    query: str = Field(description="entity to search for on Wikipedia, e.g., Mount Everest, cheetah, San Francisco, etc.")


class WikiSearch(BaseTool):
    name = "search"
    description = "useful for when you need to answer questions about current events"
    args_schema: Type[BaseModel] = SearchInput
    
    web_searcher = ReActWikipedia()
    docstore = DocstoreExplorer(web_searcher)

    def _run(self, query: str, run_manager: Optional[CallbackManagerForToolRun] = None) -> str:
        return self.docstore.search(query)

    async def _arun(self, query: str, run_manager: Optional[AsyncCallbackManagerForToolRun] = None) -> str:
        return await self.docstore.asearch(query)


wiki_search = WikiSearch()
print(wiki_search.name)
print(wiki_search.description)
print(wiki_search.args)
print(wiki_search.invoke("Corliss Archer in Kiss and Tell"))
print(wiki_search.run({"query": "Mount Everest"}))

search
useful for when you need to answer questions about current events
{'query': {'title': 'Query', 'description': 'entity to search for on Wikipedia, e.g., Mount Everest, cheetah, San Francisco, etc.', 'type': 'string'}}
Kiss and Tell is a 1945 American comedy film starring then 17-year-old Shirley Temple as Corliss Archer. In the film, two teenage girls cause their respective parents much concern when they start to become interested in boys. The parents' bickering about which girl is the worse influence causes more problems than it solves.[2]. The movie was based on the Broadway play Kiss and Tell, which was based on the Corliss Archer short stories. The stories, play and movie were all written by F.
Mount Everest[3] is Earth's highest mountain above sea level, located in the Mahalangur Himal sub-range of the Himalayas. The China–Nepal border runs across its summit point.[4] Its elevation (snow height) of 8,848.86 m (29,031 ft 8+1⁄2 in) was most recently established in 2020 by the 

In [102]:
from openai import OpenAI
from langchain_core.utils.function_calling import convert_to_openai_function
import json
import os
import re
import string
import logging
import numpy as np
import time
from datasets import load_dataset


logging.basicConfig(level=logging.INFO)


def normalize_answer(s):
    def remove_articles(text):
        return re.sub(r"\b(a|an|the)\b", " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))


def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False


def compare_answer(answer: str, label: str):
    """Compare the answer (from Agent) and label (GT).
    Label can be either a string or a number.
    If label is a number, we allow 10% margin.
    Otherwise, we do the best-effort string matching.
    """
    if answer is None:
        return False

    # see if label is a number, e.g. "1.0" or "1"
    if is_number(label):
        label = float(label)
        # try cast answer to float and return false if it fails
        try:
            answer = float(answer)
        except:
            return False
        # allow 10% margin
        if label * 0.9 < answer < label * 1.1:
            return True
        else:
            return False

    else:
        label = normalize_answer(label)
        answer = normalize_answer(answer)
        return answer == label


class BenchmarkRunner:
    
    thread_history = []
    
    
    def __init__(self, 
                 openai_client: OpenAI, 
                 model_name: str, 
                 instructions: str,
                 fail_fast: bool = False,
                 output_file_path: str = "output/hotqa_result.json"):
        """
        Benchmark an agent with an OpenAI client.
        :param openai_client: 
        :param model_name: 
        :param instructions: useful to provide examples for joiner of LLMCompiler
        :param output_file_path: 
        """
        super().__init__()
        self.fail_fast = fail_fast
        self.logger = logging.getLogger("BenchmarkRunner")
        self.logger.setLevel(logging.DEBUG)
        self.client = openai_client
        self.output_file_path = output_file_path
        self.search_tool = wiki_search
        self.model_name = model_name
        self.assistant = None
        self.instructions = instructions
        try:
            self.result = json.load(open(output_file_path)) if os.path.exists(output_file_path) else {}
        except:
            self.result = {}

    def cleanup(self):
        for thread_id in self.thread_history:
            self.logger.info(f"delete thread {thread_id}")
            self.client.beta.threads.delete(thread_id=thread_id)
        if self.assistant:
            self.logger.info(f"delete assistant {self.assistant.id}")
            self.client.beta.assistants.delete(assistant_id=self.assistant.id)

    def __enter__(self):
        return self
    
    def __exit__(self, exc_type, exc_val, exc_tb):
        try:
            self.cleanup()
        except Exception as e:
            self.logger.error(e)

    def run(self):
        self.logger.info(f"run started")
        
        self.assistant = client.beta.assistants.create(
                name="benchmark-runner",
                model=self.model_name,
                instructions=self.instructions,
                tools=[{"type": "function", "function": convert_to_openai_function(self.search_tool)}]
            )
        self.logger.info(f"assistant id: {self.assistant.id}")
        
        
        for item in load_dataset("scholarly-shadows-syndicate/hotpotqa_with_qa_gpt35", split="validation", streaming=True).filter(lambda x: x["level"] == "hard"):
            item_id = item['id']
            self.logger.info(f"item id={item_id}, contained in result? {item_id in self.result}")
            if item_id in self.result and self.result[item_id]["ok"]:
                continue
            run = self.client.beta.threads.create_and_run(
                assistant_id=self.assistant.id,
                thread={
                    "messages": [
                        {"role": "user", "content": item["question"]}
                    ]
                },
                stream=False)
            self.logger.info(f"run, id={run.id}, thread_id={run.thread_id}")

            self.thread_history.append(run.thread_id)
            result_item = {
                "ok": False,
                "answer": "",
                "truth": item["answer"], 
                "id": item["id"],
                "rt": 0
            }
            while True:
                ts_1 = time.time()
                run = self.client.beta.threads.runs.retrieve(thread_id=run.thread_id, run_id=run.id)
                if run.status == "queued" or run.status == "in_progress":
                    time.sleep(1)
                elif run.status == "requires_action":
                    tool_messages = []
                    for call in run.required_action.submit_tool_outputs.tool_calls:
                        self.logger.info(f"got tool call: {call.json()}")
                        if call.type == "function" and call.function.name == self.search_tool.name:
                            try:
                                tool_result  = self.search_tool.run(json.loads(call.function.arguments))
                                tool_messages.append({"tool_call_id": call.id, "output": tool_result})
                            except Exception as e:
                                if self.fail_fast:
                                    raise e
                                self.logger.error(f"Tool error {call.function.name}  with args: {call.function.arguments}", e)
                                break
                        else:
                            if self.fail_fast:
                                raise RuntimeError(f"Unknown tool call occurred, function name {call.function.name}")
                            self.logger.error(f"Unknown tool call occurred, function name {call.function.name}")
                            break
                    self.logger.info(f"len(tool_messages)={len(tool_messages)}, len(tool_calls)={len(run.required_action.submit_tool_outputs.tool_calls)}")
                    if len(tool_messages) == len(run.required_action.submit_tool_outputs.tool_calls):
                        run = self.client.beta.threads.runs.submit_tool_outputs(thread_id=run.thread_id, run_id=run.id, tool_outputs=tool_messages)
                        self.logger.info(f"run object status after submit: {run.status}")
                    else:
                        if self.fail_fast:
                            raise RuntimeError("Not every call is responded.")
                        self.logger.error("Not every call is responded.")
                        break
                elif run.status == "completed": 
                    messages = self.client.beta.threads.messages.list(thread_id=run.thread_id, run_id=run.id, order="asc")
                    result_item["ok"] = True
                    result_item["answer"] = messages.data[-1].content[0].text.value
                    self.logger.info("begin printing trajectory =============================")
                    for message in messages.data:
                        self.logger.info(f"{message.role}: {message.content[0].text.value}")
                    self.logger.info("finish printing trajectory =============================")
                    break
                else:
                    if self.fail_fast:
                        raise RuntimeError(f"run is in other terminal status: {run.to_json()}")
                    self.logger.error(f"run is in other terminal status: {run.to_json()}")
                    break    
            
            result_item["rt"] = time.time() - ts_1
            self.result[item_id] = result_item
            self.logger.info(f"id={result_item['id']}, ok={result_item['ok']}")
            
            # write down the result
            with open(self.output_file_path, "w") as output_json:
                json.dump(self.result, output_json)
        
            
    def get_metrics(self):
        with open(self.output_file_path, "r") as result_file:
            result = json.load(result_file)
            result_items = result.values()
            acc = np.average([compare_answer(item["answer"], item["truth"]) for item in result_items])
            rt_avg = np.average([item["rt"] for item in result_items])
            rt_std = np.std([item["rt"] for item in result_items])
            success_rate = np.average([1 if item["ok"] else 0 for item in result_items])
            
            logging.info(f"Success rate: {success_rate}")
            logging.info(f"Accuracy: {acc}")
            logging.info(f"Latency: {rt_avg} +/- {rt_std}")
            
            return success_rate, acc, rt_avg, rt_std



DEFAULT_JOINER_INSTRUCTIONS_WITH_EXAMPLES = r'''Here are some examples with a tool named "search":

Question: Which magazine was started first Arthur's Magazine or First for Women?
search({"query": "Arthur's Magazine"})
Observation: Arthur's Magazine (1844-1846) was an American literary periodical published in Philadelphia in the 19th century.
search({"query": "First for Women"})
Observation: First for Women is a woman's magazine published by Bauer Media Group in the USA.[1] The magazine was started in 1989.
Thought: Arthur's Magazine was started in 1844. First for Women was started in 1989. 1844 (Arthur's Magazine) < 1989 (First for Women), so Arthur's Magazine was started first.
Action: Finish(Arthur's Magazine)
<END_OF_RESPONSE>

Question: Were Pavel Urysohn and Leonid Levin known for the same type of work?
search({"query": "Pavel Urysohn"})
Observation: Pavel Samuilovich Urysohn (February 3, 1898 - August 17, 1924) was a Soviet mathematician who is best known for his contributions in dimension theory.
search(Leonid Levin)
Observation: Leonid Anatolievich Levin is a Soviet-American mathematician and computer scientist.
Thought: Pavel Urysohn is a mathematician. Leonid Levin is a mathematician and computer scientist. So Pavel Urysohn and Leonid Levin have the same type of work.
Action: Finish(yes)
<END_OF_RESPONSE>

Question: What profession does Nicholas Ray and Elia Kazan have in common?
Observation: Nicholas Ray (born Raymond Nicholas Kienzle Jr., August 7, 1911 - June 16, 1979) was an American film director best known for the 1955 film Rebel Without a Cause.
Observation: Elia Kazan was an American film and theatre director.
Thought: Professions of Nicholas Ray are director, screenwriter, and actor. Professions of Elia Kazan are director. So profession Nicholas Ray and Elia Kazan have in common is director.
Action: Finish(director)
<END_OF_RESPONSE>'''

# Benchmarks


## With `mini-assistant`

Start mini assistant server.

* `llm_compiler` is used for agent execution
* `mixtral 7bx8` is hosted by vLLM. Please make sure you have set up `HUGGING_FACE_HUB_TOKEN` env for vLLM.

vLLM shell command using docker:

```shell
docker run --runtime nvidia --gpus all \
    -v /workspace/dropbox/huggingface_models:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ \
    --quantization marlin \
    --dtype=float16
```

mini-assistant shell command:

```shell
mkdir -p /tmp/mini-assistant-db
mkdir -p /tmp/mini-assistant-files
mini-assistant --db_file_path /tmp/assistant_eval.db \
  --file_store_path /tmp/mini-assistant-files \
  --agent_executor_type=llm_compiler \
  --model_provider=openai \
  --openai_port=8000 \
  --openai_host=192.168.0.134 \
  --openai_protocol=http \
  --port=9091 \
  --verbose
```

Please make sure to make necessary modification to `--openai_host`, `--openai_port` and `--openai_protocol` according to your own vLLM setup.  


And kick off benchmarks in python script:

In [None]:
if True:
    if not os.path.exists("./output"):
        os.mkdir("./output")
    client = OpenAI(base_url="http://localhost:9091/v1")
    with BenchmarkRunner(openai_client=client, instructions=DEFAULT_JOINER_INSTRUCTIONS_WITH_EXAMPLES, model_name="TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ", output_file_path="./output/miniassistant_result.json") as benchmark_runner:
        benchmark_runner.run()
        benchmark_runner.get_metrics()
    

INFO:BenchmarkRunner:run started
INFO:httpx:HTTP Request: POST http://localhost:9091/v1/assistants "HTTP/1.1 200 OK"
INFO:BenchmarkRunner:assistant id: asst_760563219457638400
INFO:BenchmarkRunner:item id=5a8b57f25542995d1e6f1371, contained in result? True
INFO:BenchmarkRunner:item id=5a8c7595554299585d9e36b6, contained in result? True
INFO:BenchmarkRunner:item id=5a85ea095542994775f606a8, contained in result? True
INFO:BenchmarkRunner:item id=5adbf0a255429947ff17385a, contained in result? True
INFO:BenchmarkRunner:item id=5a8e3ea95542995a26add48d, contained in result? True
INFO:BenchmarkRunner:item id=5abd94525542992ac4f382d2, contained in result? True
INFO:BenchmarkRunner:item id=5a85b2d95542997b5ce40028, contained in result? True
INFO:BenchmarkRunner:item id=5a87ab905542996e4f3088c1, contained in result? True
INFO:BenchmarkRunner:item id=5a7bbb64554299042af8f7cc, contained in result? True
INFO:BenchmarkRunner:item id=5a8db19d5542994ba4e3dd00, contained in result? True
INFO:Benchmark

In [None]:
if False:
    if not os.path.exists("./output"):
        os.mkdir("./output")
    client = OpenAI()
    with BenchmarkRunner(openai_client=client, instructions=DEFAULT_JOINER_INSTRUCTIONS_WITH_EXAMPLES,  model_name="gpt-3.5-turbo", output_file_path="./output/openai_result.json") as benchmark_runner:
        benchmark_runner.run()
        benchmark_runner.get_metrics()

# Verdicts

## Performance analysis on `LLMCompilerAgentExecutor`

In version of `0.1.3`, `LLMCompilerAgentExecutor` enables paralleled function calling with opensource models. In the evaluation test with `hotpotqa` validation dataset, it achieves accuracy around 60%. This is a similar number to original paper. 

After checking bad cases, although no further experiments are done due my limited energy, some concerns are raised:

1. `LLMCompiler` requires LLM to have exceptional reasoning and instruct following capabilities at least on par with `gpt-3.5-turbo`, or it many be almost unusable. And to have such traits, more often, 70B models seem to be a must. If we are talking about sparse MoE models, `mixtral-7bx8` is really the gatekeeper. 7B or 13B models simple don't do the tricks. In the end, this will limit the adoptions on local devices.  
2. `Replan` seems to be unreliable. In original paper, the efficiency of re-planing is not discussed in details. In my experiments, if the model failed to produce good plan in first plan, it's unlikely it would have better result in second round. In fact, in its [official implementation](https://github.com/SqueezeAILab/LLMCompiler/blob/main/configs/hotpotqa/configs.py#L18), in all the benchmark configs, `max-replan` is limited to `1`, which disables re-plan in the first place.
3. In the process of dependency resolution, `joiner` plays an important role to format former answers to an entity of single word. This simplifies the argument substitution for downstream function calls which depend on those results, but it has many limitations.
4. Examples should be given in instructions if you are playing with models weaker than 70B llama3.

Again, points shown above lack more quantitative experiments, and solely based on my personal observations.

The bottom line is it has better throughput for agent services than the plain `ReAct` strategy. And with proper instructions and reasonable strong model, it can achieve solid performance with function calling.


## Comparison analysis with OpenAI

Definitely more work should be done with GPT4 and even newer GTP4o. But it will cost me more time and money. So I will look into it later.

At the time being, my priority is to ensure the opensource implementation of `mini-assistant` is in the pace of other commercial offerings.   