# Evaluation of the LLM Agent

This notebook will be a prototype and use a model of the agent to test how te evaluation works.
As soon as the code is completed and functional, it will be build into a final version.
Please keep in mind this is an unfinished demo.

### Construction of the LLM KGBOT Agent

In [16]:
# Importing the Libraries

import re
import os
import functools
from typing import List, Tuple, Union

from langchain.agents import Tool, AgentExecutor, LLMSingleActionAgent, AgentOutputParser, create_openai_tools_agent, AgentType, initialize_agent, load_tools
from langchain.prompts import BaseChatPromptTemplate
from langchain.utilities import SerpAPIWrapper
from langchain.chains.llm import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain_openai import ChatOpenAI
from langchain.schema import AgentAction, AgentFinish, HumanMessage
from langchain import hub
from langchain.pydantic_v1 import BaseModel, Field
from langchain.tools import BaseTool, StructuredTool, tool
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.agents.format_scratchpad import format_to_openai_function_messages
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser
from langchain.evaluation import EvaluatorType
from langsmith.evaluation import EvaluationResult, run_evaluator
from langsmith.schemas import Example, Run
from langchain.smith import arun_on_dataset, run_on_dataset, RunEvalConfig

from RdfGraphCustom import RdfGraph
from smile_resolver import smiles_to_inchikey
from chemical_resolver import ChemicalResolver
from target_resolver import target_name_to_target_id
from taxon_resolver import TaxonResolver
from sparql import GraphSparqlQAChain

In [4]:
# Defining and importing LangSmith
# For now, all runs will be stored in the "KGBot Testing - GPT4"
# If you want to separate the traces to have a better control of specific traces.
# Metadata as llm version and temperature can be obtaneid from traces. 

# Use the code below for generating unique codes if needed
from uuid import uuid4
unique_id = uuid4().hex[0:8]

os.environ["OPENAI_API_KEY"] = ""
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "KGBot Testing - GPT4" #Please update the name here if you want to create a new project for separating the traces. 
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = ""  #This is the API key for HolobiomicsLab

from langsmith import Client
client = Client()

#Check if the client was initialized
print(client)

Client (API URL: https://api.smith.langchain.com)


In [5]:
# Setting up the tools and conditions

# Defining the Endpoint for retrieving the information
endpoint_url = 'https://enpkg.commons-lab.org/graphdb/repositories/ENPKG'
graph = RdfGraph(
    query_endpoint=endpoint_url,
    standard="rdf")

print(graph.get_schema)

#Defining some parameters for the LLM
temperature = 0.3
model_id = "gpt-4" 

# https://api.python.langchain.com/en/latest/chat_models/langchain_community.chat_models.openai.ChatOpenAI.html?highlight=chatopenai#
llm = ChatOpenAI(temperature=temperature, 
                    model=model_id, # default is 'gpt-3.5-turbo'
                    max_retries=3,
                    verbose=True,
                    model_kwargs={
                        "top_p": 0.95,
                        }
                    )

#https://api.python.langchain.com/en/latest/_modules/langchain/chains/graph_qa/sparql.html#
sparql_chain = GraphSparqlQAChain.from_llm(
    llm, graph=graph, verbose=True
)

chem_res = ChemicalResolver.from_llm(llm=llm, verbose=True)
taxon_res = TaxonResolver()

class ChemicalInput(BaseModel):
    query: str = Field(description="natural product compound string")


class SparqlInput(BaseModel):
    question: str = Field(description="the original question from the user")
    entities: str = Field(description="strings containing for all entities, entity name and the corresponding entity identifier")


tools = [
    StructuredTool.from_function(
        name = "CHEMICAL_RESOLVER",
        func = chem_res.run,
        description="The function takes a natural product compound string as input and returns a InChIKey, if InChIKey not found, it returns the NPCClass, NPCPathway or NPCSuperClass.",
        args_schema=ChemicalInput,
    ),
    StructuredTool.from_function(
        name = "TAXON_RESOLVER",
        func=taxon_res.query_wikidata,
        description="The function takes a taxon string as input and returns the wikidata ID.",
    ),
    StructuredTool.from_function(
        name = "TARGET_RESOLVER",
        func=target_name_to_target_id,
        description="The function takes a target string as input and returns the ChEMBLTarget IRI.",
    ),
    StructuredTool.from_function(
        name = "SMILE_CONVERTER",
        func=smiles_to_inchikey,
        description="The function takes a SMILES string as input and returns the InChIKey notation of the molecule.",
    ),
    StructuredTool.from_function(
        name = "SPARQL_QUERY_RUNNER",
        func=sparql_chain.run,
        description="The agent resolve the user's question by querying the knowledge graph database. Input should be a question and the resolved entities in the question.",
        args_schema=SparqlInput,
        # return_direct=True,
    ),
]


tool_names = [tool.name for tool in tools]

list of namespaces []
identifier , N48c944e9717345c1beea82c2f1edc0af
query PREFIX brick: <https://brickschema.org/schema/Brick#>
PREFIX csvw: <http://www.w3.org/ns/csvw#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX dcmitype: <http://purl.org/dc/dcmitype/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dcam: <http://purl.org/dc/dcam/>
PREFIX doap: <http://usefulinc.com/ns/doap#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX odrl: <http://www.w3.org/ns/odrl/2/>
PREFIX org: <http://www.w3.org/ns/org#>
PREFIX prof: <http://www.w3.org/ns/dx/prof/>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX schema: <https://schema.org/>
PREFIX sh: <http://www.w3.org/ns/shacl#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX sosa: <http://www.w3.org/ns/sosa/>
PREFIX ssn: <http://www.w3.org/ns/ssn/>
PREFIX time: <http://www.w3.org/2006/tim

(<http://www.w3.org/1999/02/22-rdf-syntax-ns#Alt does not look like a valid URI, trying to serialize this will break.
(<http://www.w3.org/1999/02/22-rdf-syntax-ns#Alt>, < does not look like a valid URI, trying to serialize this will break.
(<http://www.w3.org/1999/02/22-rdf-syntax-ns#Alt>, <http://www.w3.org/2000/01/rdf-schema#Container does not look like a valid URI, trying to serialize this will break.
(<http://www.w3.org/1999/02/22-rdf-syntax-ns#Alt>, <http://www.w3.org/2000/01/rdf-schema#Container does not look like a valid URI, trying to serialize this will break.
(<http://www.w3.org/1999/02/22-rdf-syntax-ns#Alt>, <http://www.w3.org/2000/01/rdf-schema#Container>) does not look like a valid URI, trying to serialize this will break.
(<http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag does not look like a valid URI, trying to serialize this will break.
(<http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag>, < does not look like a valid URI, trying to serialize this will break.
(<http://www.w

The namespace prefixes are:
brick: https://brickschema.org/schema/Brick#, csvw: http://www.w3.org/ns/csvw#, dc: http://purl.org/dc/elements/1.1/, dcat: http://www.w3.org/ns/dcat#, dcmitype: http://purl.org/dc/dcmitype/, dcterms: http://purl.org/dc/terms/, dcam: http://purl.org/dc/dcam/, doap: http://usefulinc.com/ns/doap#, foaf: http://xmlns.com/foaf/0.1/, geo: http://www.opengis.net/ont/geosparql#, odrl: http://www.w3.org/ns/odrl/2/, org: http://www.w3.org/ns/org#, prof: http://www.w3.org/ns/dx/prof/, prov: http://www.w3.org/ns/prov#, qb: http://purl.org/linked-data/cube#, schema: https://schema.org/, sh: http://www.w3.org/ns/shacl#, skos: http://www.w3.org/2004/02/skos/core#, sosa: http://www.w3.org/ns/sosa/, ssn: http://www.w3.org/ns/ssn/, time: http://www.w3.org/2006/time#, vann: http://purl.org/vocab/vann/, void: http://rdfs.org/ns/void#, wgs: https://www.w3.org/2003/01/geo/wgs84_pos#, owl: http://www.w3.org/2002/07/owl#, rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#, rdfs: htt

In [6]:
# Defining the Prompt for the search

system_message = """You are an entity resolution agent for the SPARQL_QUERY_RUNNER.
You have access to the following tools:
{tool_names}

If the question ask anything about any entities that could be natural product compound, find the relevant IRI to this chemical class using CHEMICAL_RESOLVER. Input is the chemical class name.

If a taxon is mentionned, find what is its wikidata IRI with TAXON_RESOLVER. Input is the taxon name.

If a target is mentionned, find the ChEMBLTarget IRI of the target with TARGET_RESOLVER. Input is the target name.

If a SMILE structure is mentionned, find what is the InChIKey notation of the molecule with SMILE_CONVERTER. Input is the SMILE structure.
        
Give me units relevant to numerical values in this question. Return nothing if units for value is not provided.
Be sure to say that these are the units of the quantities found in the knowledge graph.
Here is the list of units to find:
    "retention time": "minutes",
    "activity value": null, 
    "feature area": "absolute count or intensity", 
    "relative feature area": "normalized area in percentage", 
    "parent mass": "ppm (parts-per-million) for m/z",
    "mass difference": "delta m/z", 
    "cosine": "score from 0 to 1. 1 = identical spectra. 0 = completely different spectra"

Use SPARQL_QUERY_RUNNER tool to answer the question. Input contains the user question + the list of tuples of strings of the resolved entities and units found in previous steps.

If no results tell the user how to improve their question and give the SPARQL query that have been returned by the SPARQL_QUERY_RUNNER.

Give the answer to the user.
        
""".format(tool_names="\n".join(tool_names))

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_message),
        MessagesPlaceholder("chat_history", optional=True),
        ("human", "{input}"),
        MessagesPlaceholder("agent_scratchpad"),
    ]
)

# For debugging
# prompt.pretty_print()

In [7]:
# Creating the Agent and the executor

agent = create_openai_tools_agent(tools=tools, llm=llm, prompt=prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

In [8]:
#Defining some questions

q1 = 'How many features (pos ionization and neg ionization modes) have the same SIRIUS/CSI:FingerID and ISDB annotation by comparing the InCHIKey of the annotations?'
q1_bis = 'How many features (pos ionization and neg ionization modes) have the same SIRIUS/CSI:FingerID and ISDB annotation by comparing the InCHIKey2D of the annotations?'
q2 = 'Which extracts have features (pos ionization mode) annotated as the class, aspidosperma-type alkaloids, by CANOPUS with a probability score above 0.5, ordered by the decresing count of features as aspidosperma-type alkaloids? Group by extract.'
q3 = 'Among the structural annotations from the Tabernaemontana coffeoides (Apocynaceae) seeds extract taxon , which ones contain an aspidospermidine substructure, CCC12CCCN3C1C4(CC3)C(CC2)NC5=CC=CC=C45?'
q4 = 'Among the SIRIUS structural annotations from the Tabernaemontana coffeoides (Apocynaceae) seeds extract taxon, which ones are reported in the Tabernaemontana genus in Wikidata?'
q5 = 'Which compounds have annotations with chembl assay results indicating reported activity against T. cruzi by looking at the cosmic, zodiac and taxo scores?'
q6 = 'Filter the pos ionization mode features of the Melochia umbellata taxon annotated as [M+H]+ by SIRIUS to keep the ones for which a feature in neg ionization mode is detected with the same retention time (+/- 3 seconds) and a mass corresponding to the [M-H]- adduct (+/- 5ppm).'
q7 = 'For features from the Melochia umbellata taxon in pos ionization mode with SIRIUS annotations, get the ones for which a feature in neg ionization mode with the same retention time (+/- 3 seconds) has the same SIRIUS annotation by comparing the InCHIKey 2D. Return the features, retention times, and InChIKey2D'
q8 = "Which features were annotated as 'Tetraketide meroterpenoids' by SIRIUS, and how many such features were found for each species and plant part?"
q9 = "What are all distinct submitted taxons for the extracts in the knowledge graph?"
q10 = "What are the taxons, lab process and label (if one exists) for each sample? Sort by sample and then lab process"
q11 = "Count all the species per family in the collection"
q12 = "Taxons can be found in enpkg:LabExtract. Find the best URI of the Taxon in the context of this question : \n Among the structural annotations from the Tabernaemontana coffeoides (Apocynaceae) seeds extract taxon , which ones contain an aspidospermidine substructure, CCC12CCCN3C1C4(CC3)C(CC2)NC5=CC=CC=C45?"

In [None]:
# Manual testing: substitute the input for any question you like.

# agent_executor.invoke({"input": q1_bis})

### Creating the Dataset for testing

The input questions were obtained from the article mentioned below and the answers were retrieved on the database on the 12th of February of 2024

In [9]:
# Defining the input questions and the expected outputs.

inputs = [
    "How many features (positive and negative ionization modes) have the same SIRIUS/CSI:FingerID and ISDB annotation?",
    #"Which samples have features (positive ionization mode) annotated as aspidosperma-type alkaloids by CANOPUS with a probability score above 0.5, ordered by the decreasing count of features as aspidosperma-type alkaloids?",
    "Among the structural annotations from Tabernaemontana coffeoides (Apocynaceae) seeds extract, which ones contain an aspidospermidine substructure?",
    #"Among the SIRIUS structural annotations from Tabernaemontana coffeoides (Apocynaceae) seeds extract, which ones are reported in the Tabernaemontana genus in WD?",
    #"Which compounds annotated in the active extract of Melochia umbellata have activity against T. cruzi reported (in ChEMBL), and in which taxon they are reported (in Wikidata)?",
    #"Filter the positive ionization mode features of Melochia umbellata annotated as [M+H]+ by SIRIUS to keep the ones for which a feature in negative ionization mode is detected with the same retention time (± 3 seconds) and a mass corresponding to the [M-H]- adduct (± 5 ppm)",
    #"For features from Melochia umbellata in positive mode mode with SIRIUS annotations, get the ones for which a feature in negative ionization mode with the same retention time (± 3 sec) has the same SIRIUS annotation (2D IK)."
]

example_outputs = [
    "33254 features",
    #"32 samples. The sample with the highest count of features annotated as aspidosperma alkaloids (74) is from Tabernaemontana coffeoides (Apocynaceae) seeds extract.",
    #"3 distinct stereochemically undefined structures (2D InChiKey) that contain an aspidospermidine substructure: COC(=O)C1=C2Nc3ccccc3C23CCN2CC4(CC5CC67CC(=O)OC6CCN6CCC8(c9cccc(OC)c9N(C4)C58O)C67)C4OCCC4(C1)C23, COC(=O)C1CC23CCC[N+]4(C2C5(C1(CC3)NC6=CC=CC=C65)CC4)[O-], COC(=O)C1=C2Nc3ccccc3C23CCN2C3C3(CCOC3C3CC4CC56CCOC5CCN5CCC7(c8cccc(OC)c8NC47C32)C56)C1.",
    #"18 distinct stereochemically undefined structures annotated by SIRIUS in Tabernaemontana coffeoides (Apocynaceae) seeds extract and reported in at least one Tabernaemontana sp.",
    #"14 distinct stereochemically undefined structures, all of them reported in Waltheria indica.",
    #"62 features from Melochia umbellata in positive ionization mode annotated as [M+H]+ by SIRIUS with their corresponding potential [M-H]"
    #"22 features in positive ionization mode for which a feature in negative ionization mode with the same retention time has the same annotation."
]

In [10]:
# Creating the datasets for testing

dataset_name = "agent-qa-1st-try"

dataset = client.create_dataset(
    dataset_name,
    description="An example dataset of questions about the knowledge graph",
)

client.create_examples(
    inputs=[{"input": query} for query in inputs],
    outputs=[{"output": answer} for answer in example_outputs],
    dataset_id=dataset.id,
)

### Defining the evalutor

In [12]:
# Since chains can be stateful (e.g. they can have memory), we provide
# a way to initialize a new chain for each row in the dataset. This is done
# by passing in a factory function that returns a new chain for each row.

def create_agent(prompt, llm, tools):
    runnable_agent = (
        {
            "input": lambda x: x["input"],
            "agent_scratchpad": lambda x: format_to_openai_function_messages(
                x["intermediate_steps"]
            ),
        }
        | prompt
        | llm
        | OpenAIFunctionsAgentOutputParser()
    )
    return AgentExecutor(agent=runnable_agent, tools=tools, verbose=True, handle_parsing_errors=True)

In [13]:
# Defining a custom evaluator that checks if the generated answer is uninformative

@run_evaluator
def check_not_idk(run: Run, example: Example):
    """Illustration of a custom evaluator."""
    agent_response = run.outputs["output"]
    if "don't know" in agent_response or "not sure" in agent_response:
        score = 0
    else:
        score = 1
    # You can access the dataset labels in example.outputs[key]
    # You can also access the model inputs in run.inputs[key]
    return EvaluationResult(
        key="not_uncertain",
        score=score,
    )

In [18]:
# Defining the proper evaluation

evaluation_config = RunEvalConfig(
    # Evaluators can either be an evaluator type (e.g., "qa", "criteria", "embedding_distance", etc.) or a configuration for that evaluator
    evaluators=[
        # Measures whether a QA response is "Correct", based on a reference answer
        # You can also select via the raw string "qa"
        EvaluatorType.QA,
        # Measure the embedding distance between the output and the reference answer
        # Equivalent to: EvalConfig.EmbeddingDistance(embeddings=OpenAIEmbeddings())
        EvaluatorType.EMBEDDING_DISTANCE,
        # Grade whether the output satisfies the stated criteria.
        # You can select a default one such as "helpfulness" or provide your own.
        RunEvalConfig.LabeledCriteria("helpfulness"),
        # The LabeledScoreString evaluator outputs a score on a scale from 1-10.
        # You can use default criteria or write our own rubric
        RunEvalConfig.LabeledScoreString(
            {
                "accuracy": """
Score 1: The answer is completely unrelated to the reference.
Score 3: The answer has minor relevance but does not align with the reference.
Score 5: The answer has moderate relevance but contains inaccuracies.
Score 7: The answer aligns with the reference but has minor errors or omissions.
Score 10: The answer is completely accurate and aligns perfectly with the reference."""
            },
            normalize_by=10,
        ),
    ],
    # You can add custom StringEvaluator or RunEvaluator objects here as well, which will automatically be
    # applied to each prediction. Check out the docs for examples.
    custom_evaluators=[check_not_idk],
)

In [19]:
chain_results = run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=functools.partial(
        create_agent, prompt=prompt, llm=llm, tools=tools
    ),
    evaluation=evaluation_config,
    verbose=True,
    client=client,
    project_name=f"KGBot_automated_Agent_testing-{unique_id}",
    # Project metadata communicates the experiment parameters,
    # Useful for reviewing the test results
    project_metadata={
        "env": "testing-notebook",
        "model": "gpt-4",
        "prompt": "5d466cbc",
    },
)

View the evaluation results for project 'KGBot_Agent_testing-df539fee' at:
https://smith.langchain.com/o/2830b3a1-2b4b-42fc-bc61-e5012f496ae5/datasets/cae77b57-43ed-4eec-b13d-505007c42e00/compare?selectedSessions=68b9bb27-1863-4781-9c28-aa3e433076d5

View all tests for Dataset agent-qa-1st-try at:
https://smith.langchain.com/o/2830b3a1-2b4b-42fc-bc61-e5012f496ae5/datasets/cae77b57-43ed-4eec-b13d-505007c42e00
[>                                                 ] 0/6

  warn_deprecated(




[1m> Entering new AgentExecutor chain...[0m


[1m> Entering new AgentExecutor chain...[0m


[1m> Entering new AgentExecutor chain...[0m


[1m> Entering new AgentExecutor chain...[0m


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mFirst, let's resolve the taxon "Melochia umbellata" using the TAXON_RESOLVER tool.

Next, we need to find the relevant IRI for the chemical class "positive ionization mode features" and "negative ionization mode features" using the CHEMICAL_RESOLVER tool.

The units for "retention time" is "seconds" and for "mass difference" is "ppm (parts-per-million) for m/z".

Now, we can use the SPARQL_QUERY_RUNNER tool to answer the question. The input will be the user question along with the list of tuples of strings of the resolved entities and units found in the previous steps.

If no results are found, the user could improve their question by specifying the exact features they are interested in or by providing more details about the conditions u

Unnamed: 0,feedback.correctness,feedback.embedding_cosine_distance,feedback.helpfulness,feedback.score_string:accuracy,feedback.not_uncertain,error,execution_time,run_id
count,6.0,6.0,6.0,6.0,6.0,0.0,6.0,6
unique,,,,,,0.0,,6
top,,,,,,,,15c73a89-69a7-456d-922d-806b2550b488
freq,,,,,,,,1
mean,0.0,0.196121,0.333333,0.266667,1.0,,39.973089,
std,0.0,0.063734,0.516398,0.196638,0.0,,6.70753,
min,0.0,0.12858,0.0,0.1,1.0,,31.604933,
25%,0.0,0.154622,0.0,0.1,1.0,,34.395407,
50%,0.0,0.172968,0.0,0.2,1.0,,40.837023,
75%,0.0,0.2471,0.75,0.45,1.0,,45.327491,
