# Use `AutoRater` to Evaluate Answer Completeness and Accuracy for Given Questions

In this example, we will show you how to use AutoRater to verify the correctness of an answer to a given question and context pairs.

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)

### Import the dependency
First, we set system paths and import libraries.

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

import pprint

from dotenv import load_dotenv
from IPython.display import display

from uniflow.flow.client import RaterClient
from uniflow.flow.config  import RaterClassificationConfig
from uniflow.op.model.model_config  import OpenAIModelConfig
from uniflow.op.prompt_schema import Context
from uniflow.op.op import OpScope

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

### Prepare the input data

We use three example raw inputs. Each one is a tuple consisting of context, question, and answer to be labeled. The ground truth label of the first one is 'correct', and the others are 'incorrect'. Then, we use the `Context` class to wrap them.

In [2]:
raw_input = [
    ("The Pacific Ocean is the largest and deepest of Earth's oceanic divisions. It extends from the Arctic Ocean in the north to the Southern Ocean in the south.",
     "What is the largest ocean on Earth?",
     "The largest ocean on Earth is the Pacific Ocean."), # correct
    ("Shakespeare, a renowned English playwright and poet, wrote 39 plays during his lifetime. His works include famous plays like 'Hamlet' and 'Romeo and Juliet'.",
     "How many plays did Shakespeare write?",
     "Shakespeare wrote 31 plays."), # incorrect
    ("The human brain is an intricate organ responsible for intelligence, memory, and emotions. It is made up of approximately 86 billion neurons.",
     "What is the human brain responsible for?",
     "The human brain is responsible for physical movement."), # incorrect
]

data = [
    Context(context=c[0], question=c[1], answer=c[2])
    for c in raw_input
]

## Set up the config: JSON format

In this example, we will use the [`OpenAIModelConfig`](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L17) as the default LLM to generate questions and answers. If you want to use open-source models, you can replace the `OpenAIConfig` and `OpenAIModelConfig` with `HuggingfaceConfig` and [`HuggingfaceModelConfig`](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L27).

We use the default `guided_prompt` in `RaterClassificationConfig`, which includes two examples, labeled 'Yes' and 'No'. The default examples are also encapsulated within the `Context` class, which has fields for context, question, answer (and label), aligning with the input data format.

The response format is JSON, enabling the model to return a JSON object as output rather than plain text. This facilitates more convenient processing.

In [3]:
config = RaterClassificationConfig(
    flow_name="RaterFlow",
    model_config=OpenAIModelConfig(num_call=3, response_format={"type": "json_object"}),
    label2score={"Yes": 1.0, "No": 0.0})

with OpScope(name="JSONFlow"):
    client = RaterClient(config)

RaterConfig(flow_name='RaterFlow', model_config={'model_name': 'gpt-3.5-turbo-1106', 'model_server': 'OpenAIModelServer', 'num_call': 3, 'temperature': 0.9, 'response_format': {'type': 'json_object'}}, label2score={'Yes': 1.0, 'No': 0.0}, guided_prompt_template=GuidedPrompt(instruction='\n        Task: Answer Evaluation\n        Objective:\n        You are required to evaluate whether a given answer is appropriate in relation to a specific context and question. \n        Input:\n        1. Context: This is a brief text, usually a couple of sentences or a paragraph, providing key information or facts.\n        2. Question: This is a query related to the information given in the context. It is designed to test knowledge that can be inferred or directly obtained from the context.\n        3. Answer: This is a response to the question provided.\n        Evaluation Criteria:\n        Based on these, you need to judge if the answer is correct or incorrect in relation to the context and the q

### Run the client

Then we can run the client. For each item in the `raw_input`, the Client will generate an explanation and a final label, either `Yes` or `No`. The label is determined by taking the majority vote from three samples of the LLM's output, which improves stability and self-consistency compared to generating a single output.

In [4]:
output = client.run(data)
pprint.pprint(output)

100%|██████████| 3/3 [00:14<00:00,  4.84s/it]

[{'output': [{'average_score': 1.0,
              'error': 'No errors.',
              'majority_vote': 'yes',
              'response': [{'answer': 'The largest ocean on Earth is the '
                                      'Pacific Ocean.',
                            'context': 'The Pacific Ocean is the largest and '
                                       "deepest of Earth's oceanic divisions. "
                                       'It extends from the Arctic Ocean in '
                                       'the north to the Southern Ocean in the '
                                       'south.',
                            'explanation': 'The answer correctly identifies '
                                           'the Pacific Ocean as the largest '
                                           'ocean on Earth, which is '
                                           'consistent with the information '
                                           'provided in the context. '
              




We can see that model response is a JSON object.

In [5]:
pprint.pprint(output[0]["output"][0]["response"][0])

{'answer': 'The largest ocean on Earth is the Pacific Ocean.',
 'context': "The Pacific Ocean is the largest and deepest of Earth's oceanic "
            'divisions. It extends from the Arctic Ocean in the north to the '
            'Southern Ocean in the south.',
 'explanation': 'The answer correctly identifies the Pacific Ocean as the '
                'largest ocean on Earth, which is consistent with the '
                'information provided in the context. Therefore, the answer is '
                'good.',
 'label': 'Yes',
 'question': 'What is the largest ocean on Earth?'}


The model's responses can be distilled into majority votes, as shown below. Given the non-deterministic nature of the LLM (where each inference could yield a different output), we've enhanced stability and self-consistency by averaging results from three LLM output samplings, a notable improvement over single-output scenarios.

In [6]:
for idx, o in enumerate(output):
    majority_vote = o['output'][0]['majority_vote']
    average_score = o['output'][0]['average_score']
    print(f"data {idx} has majority vote \033[31m{majority_vote}\033[0m and average score \033[34m{average_score}\033[0m")

data 0 has majority vote [31myes[0m and average score [34m1.0[0m
data 1 has majority vote [31mno[0m and average score [34m0.0[0m
data 2 has majority vote [31mno[0m and average score [34m0.0[0m


## Set up the config: Text format

Following the previous settings, we changed `response_format={"type": "text"}` passed to `OpenAIModelConfig`, so the model will output plain text instead of a JSON object. In this case, AutoRater will use a regex to match the label.

In [7]:
config = RaterClassificationConfig(
    flow_name="RaterFlow",
    model_config=OpenAIModelConfig(num_call=3, response_format={"type": "text"}),
    label2score={"Yes": 1.0, "No": 0.0})

with OpScope(name="TextFlow"):
    client = RaterClient(config)

output = client.run(data)

pprint.pprint(output)

RaterConfig(flow_name='RaterFlow', model_config={'model_name': 'gpt-3.5-turbo-1106', 'model_server': 'OpenAIModelServer', 'num_call': 3, 'temperature': 0.9, 'response_format': {'type': 'text'}}, label2score={'Yes': 1.0, 'No': 0.0}, guided_prompt_template=GuidedPrompt(instruction='\n        Task: Answer Evaluation\n        Objective:\n        You are required to evaluate whether a given answer is appropriate in relation to a specific context and question. \n        Input:\n        1. Context: This is a brief text, usually a couple of sentences or a paragraph, providing key information or facts.\n        2. Question: This is a query related to the information given in the context. It is designed to test knowledge that can be inferred or directly obtained from the context.\n        3. Answer: This is a response to the question provided.\n        Evaluation Criteria:\n        Based on these, you need to judge if the answer is correct or incorrect in relation to the context and the question

  0%|          | 0/3 [00:00<?, ?it/s]

100%|██████████| 3/3 [00:07<00:00,  2.37s/it]

[{'output': [{'average_score': 1.0,
              'error': 'No errors.',
              'majority_vote': 'yes',
              'response': ['explanation: The context states that the Pacific '
                           "Ocean is the largest and deepest of Earth's "
                           'oceanic divisions, so the answer is correct.\n'
                           'label: Yes',
                           'explanation: The context explicitly states that '
                           'the Pacific Ocean is the largest ocean on Earth, '
                           'so the answer is correct.\n'
                           'label: Yes',
                           'explanation: The context explicitly states that '
                           'the Pacific Ocean is the largest ocean on Earth, '
                           'so the answer is correct.\n'
                           'label: Yes'],
              'scores': [1.0, 1.0, 1.0],
              'votes': ['yes', 'yes', 'yes']}],
  'root': <uniflow.




The model's responses can be distilled into majority votes, as shown below. Given the non-deterministic nature of the LLM (where each inference could yield a different output), we've enhanced stability and self-consistency by averaging results from three LLM output samplings, a notable improvement over single-output scenarios.

In [8]:
for idx, o in enumerate(output):
    majority_vote = o['output'][0]['majority_vote']
    average_score = o['output'][0]['average_score']
    print(f"data {idx} has majority vote \033[31m{majority_vote}\033[0m and average score \033[34m{average_score}\033[0m")

data 0 has majority vote [31myes[0m and average score [34m1.0[0m
data 1 has majority vote [31mno[0m and average score [34m0.0[0m
data 2 has majority vote [31mno[0m and average score [34m0.0[0m
