# Local Evaluation - Groundedness

After you have setup and configured the prompt flow, its time to evaluation its performance. Here we can use the prompt flow SDK to test different questions and see how the prompt flow performs using the evaluation prompt flows provided.

In [1]:
from promptflow import PFClient

pf_client = PFClient()

from dotenv import load_dotenv

load_dotenv()



True

In [3]:
# Add a question to test the base prompt flow.
question = "Can you tell me about your jackets?"
customerId = "4"
output = pf_client.test(
    flow="../contoso-chat", # Path to the flow directory
    inputs={ # Inputs to the flow
        "chat_history": [],
        "question": question,
        "customerId": customerId,
    },
)

output["answer"] = "".join(list(output["answer"]))



Prompt flow service has started...
2024-06-04 23:45:41 +0000    7762 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-06-04 23:45:41 +0000    7762 execution.flow     INFO     Start to run 5 nodes with concurrency level 16.
2024-06-04 23:45:41 +0000    7762 execution.flow     INFO     Executing node question_embedding. node run id: c2206879-2ed0-4218-9d7c-ea1b0fedf62d_question_embedding_0
2024-06-04 23:45:41 +0000    7762 execution.flow     INFO     Executing node customer_lookup. node run id: c2206879-2ed0-4218-9d7c-ea1b0fedf62d_customer_lookup_0
2024-06-04 23:45:41 +0000    7762 execution.flow     INFO     Node customer_lookup completes.
2024-06-04 23:45:42 +0000    7762 execution.flow     INFO     Node question_embedding completes.
2024-06-04 23:45:42 +0000    7762 execution.flow     INFO     Executing node retrieve_documentation. node run id: c2206879-2ed0-4218-9d7c-ea1b0fedf62d_retrieve_documentation_0
You can view the trace detail from the following URL:

In [4]:
output

{'answer': "Of course, Sarah Lee! 😄 We have two fantastic jackets that would be perfect for your outdoor adventures:\n\n1. Summit Breeze Jacket: 🏔️🌬️ This lightweight jacket is your ultimate companion for hiking. It's windproof, water-resistant, and has adjustable cuffs for your comfort. With its inner lining and reflective accents, you'll feel confident day or night. It's time to reach new heights with the Summit Breeze Jacket! \n\n2. RainGuard Hiking Jacket: ☔⛰️ Don't let the weather stop you! This jacket is waterproof, breathable, and has ventilation zippers for increased airflow. With adjustable features and plenty of pockets, it's perfect for all your outdoor undertakings!\n\nBoth jackets are amazing options, and with your Platinum membership status, you'll get the best deals! 🌟 Let me know if you need any more information or recommendations. Happy exploring! 🚀",
 'context': [{'id': '3',
   'title': 'Summit Breeze Jacket',
   'content': "Discover the joy of hiking with MountainSty

Test the groundedness of the prompt flow with the answer from the above question.

In [None]:
test = pf_client.test(
    flow="groundedness",
    inputs={
        "question": question,
        "context": str(output["context"]),
        "answer": output["answer"],
    },
)

In [None]:
test

# Local Evaluation - Multiple Metrics 

Now use the same prompt flow and test it against the Multi Evaluation flow for groundedness, coherence, fluency, and relevance.

In [None]:
print("question",question)
print("context",output["context"])
print("answer", output["answer"])
test_multi = pf_client.test(
    "multi_flow",
    inputs={
        "question": question,
        "context": str(output["context"]),
        "answer": output["answer"],
    },
)


In [None]:
test_multi

# AI Studio Azure batch run on an evaluation json dataset

Now in order to test these more thoroughly, we can use the Azure AI Studio to run batches of test data with the evaluation prompt flow on a larger dataset.

In [1]:
import json
# Import required libraries
from promptflow.azure import PFClient

# Import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

In [2]:
try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

Populate the `config.json` file with the subscription_id, resource_group, and workspace_name.

Add the runtime from the AI Studio that will be used for the cloud batch runs.

In [4]:
# Update the runtime to the name of the runtime you created previously
runtime = "automatic"
# load flow
flow = "../contoso-chat"
flow_hf = "../contoso-chat_hf"
# load data
data = "../data/salestestdata.jsonl"
config_path = "../config.json"

In [5]:
# get current time stamp for run name
import datetime
now = datetime.datetime.now()
timestamp = now.strftime("%Y_%m_%d_%H%M%S")
run_name = timestamp+"chat_base_run"
print(run_name)

pf_azure_client = PFClient.from_config(credential=credential, path=config_path)

2024_06_04_142628chat_base_run


Create a base run to use as the variant for the evaluation runs. 

_NOTE: If you get "'An existing connection was forcibly closed by the remote host'" run the cell again._

In [6]:
# create base run in Azure Ai Studio
base_run = pf_azure_client.run(
    flow=flow,
    data=data,
    column_mapping={
        # reference data
        "customerId": "${data.customerId}",
        "question": "${data.question}",
    },
    runtime=runtime,
    # create a display name as current datetime
    display_name=run_name,
    name=run_name
)
print(base_run)



Portal url: https://ai.azure.com/projectflows/trace/run/2024_06_04_142628chat_base_run/details?wsid=/subscriptions/28e0cd19-9f05-4b6c-bac8-3bb37e8eeee3/resourcegroups/rg-deployment6/providers/Microsoft.MachineLearningServices/workspaces/contoso-chat-sf-aiproj
name: 2024_06_04_142628chat_base_run
created_on: '2024-06-04T14:26:48.142237+00:00'
status: Preparing
display_name: 2024_06_04_142628chat_base_run
description:
tags: {}
properties:
  azureml.promptflow.inputs_mapping: '{"customerId":"${data.customerId}","question":"${data.question}"}'
  azureml.promptflow.runtime_name: automatic
  azureml.promptflow.disable_trace: 'false'
  azureml.promptflow.session_id: 3ee7611837014426b3f88afef8d079a99d16385b9b37e347
  azureml.promptflow.definition_file_name: flow.dag.yaml
  azureml.promptflow.flow_lineage_id: 23ccbb0c520a4d89658b65dfca05e64d37bb66f7118de0ef2e9c2fce1513d985
  azureml.promptflow.flow_definition_datastore_name: workspaceblobstore
  azureml.promptflow.flow_definition_blob_path: Loc

In [None]:
pf_azure_client.stream(base_run)

In [16]:
details = pf_azure_client.get_details(base_run)
details.head(10)

Unnamed: 0_level_0,inputs.customerId,inputs.question,inputs.line_number,inputs.chat_history,outputs.answer,outputs.context
outputs.line_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,4,tell me about your hiking jackets,0,[],Hey Sarah Lee! 🌟 Let me tell you about our awe...,"[{'id': '17', 'title': 'RainGuard Hiking Jacke..."
1,1,Do you have any climbing gear?,1,[],"Hey John! 👋 Absolutely, we have some great cli...","[{'id': '9', 'title': 'SummitClimber Backpack'..."
2,3,Can you tell me about your selection of tents?,2,[],Hey Michael! 👋 We have a great selection of te...,"[{'id': '15', 'title': 'SkyView 2-Person Tent'..."
3,6,Do you have any hiking boots?,3,[],"Hey Emily Rodriguez! 👋 Absolutely, we have som...","[{'id': '4', 'title': 'TrekReady Hiking Boots'..."
4,2,What gear do you recommend for hiking?,4,[],"Hey Jane! 🌲🏕️ For hiking, I recommend the Trai...","[{'id': '10', 'title': 'TrailBlaze Hiking Pant..."


In [30]:
details = pf_azure_client_hf.get_details(base_run_hf)
details.head(10)

Unnamed: 0_level_0,inputs.customerId,inputs.question,inputs.line_number,inputs.chat_history
outputs.line_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,4,tell me about your hiking jackets,0,[]
1,1,Do you have any climbing gear?,1,[]
2,3,Can you tell me about your selection of tents?,2,[]
3,6,Do you have any hiking boots?,3,[]
4,2,What gear do you recommend for hiking?,4,[]


# Cloud Eval run on Json Data

In [None]:
eval_flow = "multi_flow/"

data = "../data/salestestdata.jsonl"
run_name = timestamp+"chat_eval_run"
print(run_name)



eval_run_variant = pf_azure_client.run(
    flow=eval_flow,
    data=data,  # path to the data file
    run=base_run,  # use run as the variant
    column_mapping={
        # reference data
        "customerId": "${data.customerId}",
        "question": "${data.question}",
        "context": "${run.outputs.context}",
        # reference the run's output
        "answer": "${run.outputs.answer}",
    },
    runtime=runtime,
    display_name=run_name,
    name=run_name
)


In [None]:
pf_azure_client.stream(eval_run_variant)

In [None]:
details = pf_azure_client.get_details(eval_run_variant)
details.head(10)

In [None]:

metrics = pf_azure_client.get_metrics(eval_run_variant)
print(json.dumps(metrics, indent=4))

In [None]:
pf_azure_client.visualize([base_run, eval_run_variant])