<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Evaluating Hugging Face smolagents</h1>

The purpose of this notebook is:

## Install Dependencies and Import Libraries

In [None]:
%pip install -q arize-phoenix opentelemetry-sdk opentelemetry-exporter-otlp openinference-instrumentation-smolagents
%pip install smolagents -q
%pip install openai -q

In [2]:
import os 
import openai

In [None]:
import phoenix as px
px.launch_app()

In [None]:
from phoenix.otel import register

# configure the Phoenix tracer
myTrace = register(
  project_name="my-agent-app", # Default is 'default'
)

In [4]:
from openinference.instrumentation.smolagents import SmolagentsInstrumentor

SmolagentsInstrumentor().instrument(tracer_provider=myTrace)

In [5]:
from smolagents import (
    CodeAgent,
    ToolCallingAgent,
    ManagedAgent,
    DuckDuckGoSearchTool,
    VisitWebpageTool,
    HfApiModel,
)

model = HfApiModel()

agent = ToolCallingAgent(
    tools=[DuckDuckGoSearchTool(), VisitWebpageTool()],
    model=model,
)
managed_agent = ManagedAgent(
    agent=agent,
    name="managed_agent",
    description="This is an agent that can do web search.",
)
manager_agent = CodeAgent(
    tools=[],
    model=model,
    managed_agents=[managed_agent],
)

In [None]:
questions = [
    "Find the top-rated Italian restaurants in Berkeley, CA that are open after 10 PM on weekdays", 
    "What time is it in Tokyo right now?", 
    "What is the definition of quantum computing?", 
    "What is 15% of 240?", 
    "Who wrote Pride and Prejudice, and when was it published?", 
    "When did World War II end?", 
    "How do you write a Python function to calculate the factorial of a number?", 
    "What is the most popular food in the universe?", 
    "What is my favorite color?", 
]
for question in questions:
    prompt = question + " if you don't know the answer to this question, please output 'I can not answer that'" 
    manager_agent.run(prompt)

## LLM classify

### Evaluating DuckDuckGo Search Tool

In [24]:
import phoenix as px
from phoenix.trace.dsl import SpanQuery
import json
query = SpanQuery().where(
    "name == 'DuckDuckGoSearchTool'",
).select(
    input="input.value", # this parameter must be named input to work with the RAG_RELEVANCY_PROMPT_TEMPLATE
    reference="output.value", # this parameter must be named reference to work with the RAG_RELEVANCY_PROMPT_TEMPLATE
)

# The Phoenix Client can take this query and return the dataframe.
tool_spans = px.Client().query_spans(query, project_name="my-agent-app")
tool_spans["input"] = tool_spans["input"].apply(lambda x: json.loads(x).get("kwargs", {}).get("query", ""))
tool_spans.head()

Unnamed: 0_level_0,input,reference
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1
f9903f6640c12d6d,"top-rated Italian restaurants in Berkeley, CA ...",## Search Results\n\n[THE BEST 10 Italian Rest...
82e7f8e730342977,"top-rated Italian restaurants in Berkeley, CA ...",## Search Results\n\n[THE BEST 10 Italian Rest...
3acbc159c1936edd,"top-rated Italian restaurants in Berkeley, CA ...",## Search Results\n\n[THE BEST 10 Italian Rest...
b1636783413982df,opening times of top-rated Italian restaurants...,## Search Results\n\n[Top 7 italian restaurant...
e07133cf52308585,"top-rated Italian restaurants in Berkeley, CA ...",## Search Results\n\n[THE BEST 10 Italian Rest...


In [17]:
from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify
)
import nest_asyncio
nest_asyncio.apply()

print(RAG_RELEVANCY_PROMPT_TEMPLATE)



You are comparing a reference text to a question and trying to determine if the reference text
contains information relevant to answering the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Reference text]: {reference}
    ************
    [END DATA]
Compare the Question above to the Reference text. You must determine whether the Reference text
contains information that can answer the Question. Please focus on whether the very specific
question can be answered by the information in the Reference text.
Your response must be single word, either "relevant" or "unrelated",
and should not contain any text or characters aside from that word.
"unrelated" means that the reference text does not contain an answer to the Question.
"relevant" means the reference text contains an answer to the Question.


In [33]:
eval_model = OpenAIModel(model="gpt-4o")

eval_results = llm_classify(
    dataframe=tool_spans,
    model=eval_model,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    rails=["relevant", "unrelated"],
    concurrency=10,
    provide_explanation=True,
)
eval_results["score"] = eval_results["explanation"].apply(lambda x: 1 if "relevant" in x else 0)

llm_classify |          | 0/14 (0.0%) | ⏳ 00:00<? | ?it/s

In [34]:
eval_results.head()

Unnamed: 0_level_0,label,explanation,exceptions,execution_status,execution_seconds,score
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
f9903f6640c12d6d,unrelated,The question asks for top-rated Italian restau...,[],COMPLETED,5.223566,0
82e7f8e730342977,unrelated,The question asks for top-rated Italian restau...,[],COMPLETED,2.478048,0
3acbc159c1936edd,unrelated,The question asks for top-rated Italian restau...,[],COMPLETED,1.952315,0
b1636783413982df,unrelated,The question asks for the opening times of top...,[],COMPLETED,2.177949,0
e07133cf52308585,unrelated,The question asks for top-rated Italian restau...,[],COMPLETED,4.843755,0


In [35]:
from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(SpanEvaluations(eval_name="DuckDuckGoSearchTool Relevancy", dataframe=eval_results))

## experiment

In [None]:
%pip install arize-phoenix>=7.8.0

In [12]:
import os
# Set the phoenix collector endpoint. Commonly http://localhost:6006 
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "http://localhost:6006"

In [13]:
import phoenix as px
# Initialize a phoenix client
client = px.Client()
# Get the current dataset version. You can omit the version for the latest.
dataset = client.get_dataset(name="duckducks", version_id="RGF0YXNldFZlcnNpb246MQ==")

In [21]:
from phoenix.experiments.types import Example
# Define your task
# Typically should be an LLM call or a call to your application
def my_task(example: Example) -> str:
    # This is just an example of how to return a JSON serializable value
    return f"i want you to test the relevance of some text that was found given a certain input question, and give me the words 'relevant' or 'unrelated'. if it is correct or incorrect. {example.input['input']} and {example.output['output']}"

In [14]:
evalPrompt = """You are comparing some text puled from search engines to a question and trying to determine if this text
contains information relevant to answering the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Reference text]: {reference}
    [END DATA]

Compare the Question above to the Reference text. You must determine whether the Reference text
contains information that can answer the Question. Please focus on whether the very specific
question can be answered by the information in the Reference text.
Your response must be single word, either "relevant" or "unrelated",
and should not contain any text or characters aside from that word.
"unrelated" means that the reference text does not contain an answer to the Question.
"relevant" means the reference text contains an answer to the Question. """ 

In [19]:
chatClient = openai.OpenAI()
def correct_json(input, output) -> int:
    prompt = evalPrompt.format(question=input, reference=output)
    evals = chatClient.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    print(evals.choices[0].message.content)
    evaluation_result = evals.choices[0].message.content.strip()
    print(evaluation_result)
    return evaluation_result

# Store the evaluators for later use
evaluators = [correct_json]

In [None]:
from phoenix.experiments import run_experiment

experiment = run_experiment(dataset, my_task, evaluators=evaluators)