# Evaluate with various inputs

## Objective

This notebook walks through how to use jsonl and csv files as inputs for evaluation, as well as both query/response and conversation-based inputs within those files. 

Note: When this notebook refers to 'conversations', we are referring to the definition of conversations defined [here](https://learn.microsoft.com/en-us/python/api/azure-ai-evaluation/azure.ai.evaluation.conversation?view=azure-python#attributes). This is a simplified variant on the broader Chat Protocol standard that is defined [here](https://github.com/microsoft/ai-chat-protocol)

## Time

You should expect to spend about 10 minutes running this notebook.

## Setup


In [None]:
# Install the Evaluation SDK package
%pip install azure-ai-evaluation

### Imports
Run this cell to import everything that is needed for this sample

In [None]:
from azure.ai.evaluation import evaluate
from typing import List, Tuple, Dict, Optional, TypedDict
from pathlib import Path

## Evaluator definition

We define a toy math evaluator below to showcase multi-input handling. A variety of built-in evaluators have a similar input structure to the evaluator below, like the `ContentSafetyEvaluator` and the `ProtectedMaterialEvaluator`. However they all require API connections to function. To avoid that setup and keep this sample offline-capable, this toy evaluator requires no external support.

In [None]:
# Underlying evaluation: The return ratio of the query to response lengths
def query_response_ratio(query: str, response: str) -> float:
    return len(query) / len(response)


# Helper function that converts a conversation into a list of query-response pairs
def unwrap_conversation(conversation: Dict) -> List[Tuple[str, str]]:
    queries = []
    responses = []
    for turn in conversation["messages"]:
        if turn["role"] == "user":
            queries.append(turn["content"])
        else:
            responses.append(turn["content"])
    return zip(queries, responses)


# Define the output of the evaluation to make the sample repo's robust type requirements happy.
class EvalOutput(TypedDict, total=False):
    result: float


# Actual evaluation function, which handles either a single query-response pair or a conversation
def simple_evaluator_function(
    query: Optional[str] = None, response: Optional[str] = None, conversation: Optional[str] = None
) -> EvalOutput:
    if conversation is not None and query is None and response is None:
        per_turn_results = [query_response_ratio(q, r) for q, r in unwrap_conversation(conversation)]
        return {"result": sum(per_turn_results) / len(per_turn_results), "per_turn_results": per_turn_results}
    if conversation is None and query is not None and response is not None:
        return {"result": query_response_ratio(query, response)}
    raise ValueError("Either a conversation or a query-response pair must be provided.")


# Feel free to replace this assignment with more complex evaluation functions for further testing.
my_evaluator = simple_evaluator_function

With the evaluator defined above, we can input either a query and response together, or a conversation to receive a result:

In [None]:
# Query+response evaluation
qr_result = my_evaluator(query="Hello", response="world")
print(f"query/response output: {qr_result}")

conversation_input = {
    "messages": [
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "world"},
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "world and more words to change ratio"},
    ]
}

# Conversation evaluation
conversation_result = my_evaluator(conversation=conversation_input)
print(f"conversation output: {conversation_result}")

## Datasets

Direct inputs into evaluators as shown above are useful for sanity checks. But for larger datasets we typically input the evaluator and a dataset file into the `evaluate` method. For that, we will need some data files.

Included in this sample directory are 3 files:
- qr_data.jsonl contains query/response inputs in jsonl format.
- qr_data.csv contains query/response inputs in csv format.
- conversation_data.jsonl contains conversation inputs in jsonl format.

Conversations and other complex inputs are not supported via csv inputs, so there is no corresponding "conversation_data.csv" file. Each file contains the same three query/response pairs, but in the conversation dataset, the second and third pairs are wrapped into a single, 4-turn conversation.

Double check the contents of these files by changing the print statement below. You might need to alter the `path_to_data` value depending on where your notebook is running:

In [None]:
# Change this depending on where your notebook is running.
# Default value assumes that the notebook is running in the root of the repository.
path_to_data = "./scenarios/evaluate/evaluate_with_various_inputs"
# Define data path variables.
qr_js_data = path_to_data + "/qr_data.jsonl"
qr_csv_data = path_to_data + "/qr_data.csv"
conversation_js_data = path_to_data + "/conversation_data.jsonl"

# Change variable referenced here to check different files
with Path(qr_js_data).open() as f:
    print(f.read())

## Evaluation

Now that we have some datasets and an evaluator, and can pass both of them into evaluate. Starting with query/response jsonl inputs:

In [None]:
js_qr_output = evaluate(
    data=qr_js_data,
    evaluators={"test": my_evaluator},
    _use_pf_client=False,  # Avoid using PF dependencies to further simplify the example
)

eval_row_results = [row["outputs.test.result"] for row in js_qr_output["rows"]]
metrics = js_qr_output["metrics"]

print(f"query/response jsonl results: {eval_row_results} \nwith overall metrics: {metrics}")

Now let's run the evaluation using the conversation-based jsonl data. Notice that the evaluator works for both conversations that only convert into a single query response pair, and for conversations that convert into multiple query response pairs. It also produces an extra output called `per_turn_results`, which allows you to check the results of each query-response evaluation that comprised a conversation, since the top-level result is an average of these values. This `per_turn_results` value is also produced by built-in evaluators when evaluating conversations.

In [None]:
js_convo_output = evaluate(
    data=conversation_js_data,
    evaluators={"test": my_evaluator},
    _use_pf_client=False,
)

eval_row_results = [row["outputs.test.result"] for row in js_convo_output["rows"]]
per_turn_results = [row["outputs.test.per_turn_results"] for row in js_convo_output["rows"]]
metrics = js_convo_output["metrics"]

print(
    f"""conversation jsonl results: {eval_row_results} 
with per turn results: {per_turn_results} 
and overall metrics: {metrics}"""
)

Next we run the evaluation using the csv file as input. As expected, the results are the same as the equivalent jsonl file:

In [None]:
csv_qr_output = evaluate(
    data=qr_csv_data,
    evaluators={"test": my_evaluator},
    _use_pf_client=False,
)

eval_row_results = [row["outputs.test.result"] for row in csv_qr_output["rows"]]
metrics = csv_qr_output["metrics"]

print(f"Query/response csv results: {eval_row_results} \nwith overall metrics: {metrics}")

## Conclusion

This sample has shown various ways to input data using `evaluate`, and the difference between query/response and conversation-based inputs. As the SDK is improved, more of the built-in evaluators will continue to support a larger variety of input schemes. We encourage users to leverage which ever options suit their needs.