# Use `AutoRater` to Evaluate Answer Completeness and Accuracy for Given Questions

In this example, we will show you how to use AutoRater to verify the correctness of an answer to a given question and context pairs.

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)

### Import the dependency
First, we set system paths and import libraries.

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

import pprint

from dotenv import load_dotenv
from IPython.display import display

from uniflow.flow.client import RaterClient
from uniflow.flow.config  import RaterClassificationConfig
from uniflow.op.model.model_config  import OpenAIModelConfig
from uniflow.op.prompt_schema import Context
from uniflow.op.op import OpScope

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

### Prepare the input data

We use 3 example raw inputs. Each one is a tuple with context, question and answer to be labeled. The grounding truth label of first one is correct and other are incorrect. Then we use `Context` class to wrap them.

In [2]:
raw_input = [
    ("The Pacific Ocean is the largest and deepest of Earth's oceanic divisions. It extends from the Arctic Ocean in the north to the Southern Ocean in the south.",
     "What is the largest ocean on Earth?",
     "The largest ocean on Earth is the Pacific Ocean."), # correct
    ("Shakespeare, a renowned English playwright and poet, wrote 39 plays during his lifetime. His works include famous plays like 'Hamlet' and 'Romeo and Juliet'.",
     "How many plays did Shakespeare write?",
     "Shakespeare wrote 31 plays."), # incorrect
    ("The human brain is an intricate organ responsible for intelligence, memory, and emotions. It is made up of approximately 86 billion neurons.",
     "What is the human brain responsible for?",
     "The human brain is responsible for physical movement."), # incorrect
]

data = [
    Context(context=c[0], question=c[1], answer=c[2])
    for c in raw_input
]

### Set up the config: JSON format

In this example, we will use the [`OpenAIModelConfig`](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L17) as the default LLM to generate questions and answers. If you want to use open-source models, you can replace the `OpenAIConfig` and `OpenAIModelConfig` with `HuggingfaceConfig` and [`HuggingfaceModelConfig`](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L27).

We use the default `guided_prompt` in `RaterClassificationConfig`, which contains two examples, labeled by Yes and No. The default examples are also wrap by `Context` class with fileds of context, question, answer (and label), consistent with input data.

The response format is `json`, so the model returns json object as output instead of plain text, which can be processed more conveniently. 

In [3]:
config = RaterClassificationConfig(
    flow_name="RaterFlow",
    model_config=OpenAIModelConfig(num_call=3, response_format={"type": "json_object"}),
    label2score={"Yes": 1.0, "No": 0.0})

with OpScope(name="JSONFlow"):
    client = RaterClient(config)

RaterConfig(flow_name='RaterFlow', model_config={'model_name': 'gpt-3.5-turbo-1106', 'model_server': 'OpenAIModelServer', 'num_call': 3, 'temperature': 0.9, 'response_format': {'type': 'json_object'}}, label2score={'Yes': 1.0, 'No': 0.0}, guided_prompt_template=GuidedPrompt(instruction='Rate the answer based on the question and the context.\n        Follow the format of the examples below to include context, question, answer, and label in the response.\n        The response should not include examples in the prompt.', examples=[Context(context='The Eiffel Tower, located in Paris, France, is one of the most famous landmarks in the world. It was constructed in 1889 and stands at a height of 324 meters.', question='When was the Eiffel Tower constructed?', answer='The Eiffel Tower was constructed in 1889.', explanation='The context explicitly mentions that the Eiffel Tower was constructed in 1889, so the answer is correct.', label='Yes'), Context(context='Photosynthesis is a process used b

### Run the client

Then we can run the client. For each item in the `raw_input`, the Client will generate an explanation and a final label `Yes` or `No`. The label is decided by taking the majority votes from sampling the LLM output 3 times, which improved stability and self-consistency compared with outputting only one time.

In [4]:
output = client.run(data)
pprint.pprint(output)

100%|██████████| 3/3 [00:05<00:00,  1.81s/it]

[{'output': [{'average_score': 1.0,
              'error': 'No errors.',
              'majority_vote': 'yes',
              'response': [{'answer': 'The largest ocean on Earth is the '
                                      'Pacific Ocean.',
                            'context': 'The Pacific Ocean is the largest and '
                                       "deepest of Earth's oceanic divisions. "
                                       'It extends from the Arctic Ocean in '
                                       'the north to the Southern Ocean in the '
                                       'south.',
                            'label': 'Yes',
                            'question': 'What is the largest ocean on Earth?'},
                           {'answer': 'The largest ocean on Earth is the '
                                      'Pacific Ocean.',
                            'context': 'The Pacific Ocean is the largest and '
                                       "deepest of Earth'




We can see that model response is a json object.

In [5]:
pprint.pprint(output[0]["output"][0]["response"][0])

{'answer': 'The largest ocean on Earth is the Pacific Ocean.',
 'context': "The Pacific Ocean is the largest and deepest of Earth's oceanic "
            'divisions. It extends from the Arctic Ocean in the north to the '
            'Southern Ocean in the south.',
 'label': 'Yes',
 'question': 'What is the largest ocean on Earth?'}


In [6]:
for idx, o in enumerate(output):
    majority_vote = o['output'][0]['majority_vote']
    average_score = o['output'][0]['average_score']
    print(f"data {idx} has majority vote \033[31m{majority_vote}\033[0m and average score \033[34m{average_score}\033[0m")

data 0 has majority vote [31myes[0m and average score [34m1.0[0m
data 1 has majority vote [31mno[0m and average score [34m0.3333333333333333[0m
data 2 has majority vote [31mno[0m and average score [34m0.0[0m


### Set up config: Text format

Follow the previous setting we change `response_format={"type": "text"}` passed to `OpenAIModelConfig`, so model will output plain text instead of json object. In this case, AutoRater will use a regex to match label.

In [7]:
config = RaterClassificationConfig(
    flow_name="RaterFlow",
    model_config=OpenAIModelConfig(num_call=3, response_format={"type": "text"}),
    label2score={"Yes": 1.0, "No": 0.0})

with OpScope(name="TextFlow"):
    client = RaterClient(config)

output = client.run(data)

pprint.pprint(output)

RaterConfig(flow_name='RaterFlow', model_config={'model_name': 'gpt-3.5-turbo-1106', 'model_server': 'OpenAIModelServer', 'num_call': 3, 'temperature': 0.9, 'response_format': {'type': 'text'}}, label2score={'Yes': 1.0, 'No': 0.0}, guided_prompt_template=GuidedPrompt(instruction='Rate the answer based on the question and the context.\n        Follow the format of the examples below to include context, question, answer, and label in the response.\n        The response should not include examples in the prompt.', examples=[Context(context='The Eiffel Tower, located in Paris, France, is one of the most famous landmarks in the world. It was constructed in 1889 and stands at a height of 324 meters.', question='When was the Eiffel Tower constructed?', answer='The Eiffel Tower was constructed in 1889.', explanation='The context explicitly mentions that the Eiffel Tower was constructed in 1889, so the answer is correct.', label='Yes'), Context(context='Photosynthesis is a process used by plant

100%|██████████| 3/3 [00:02<00:00,  1.36it/s]

[{'output': [{'average_score': 1.0,
              'error': 'No errors.',
              'majority_vote': 'yes',
              'response': ['\n'
                           'explanation: The answer directly addresses the '
                           'question and provides the correct information '
                           'based on the context.\n'
                           'label: Yes',
                           'label: Yes',
                           'explanation: The context explicitly states that '
                           'the Pacific Ocean is the largest ocean on Earth, '
                           'so the answer is correct.\n'
                           'label: Yes'],
              'scores': [1.0, 1.0, 1.0],
              'votes': ['yes', 'yes', 'yes']}],
  'root': <uniflow.node.Node object at 0x7f7e95f23310>},
 {'output': [{'average_score': 0.0,
              'error': 'No errors.',
              'majority_vote': 'no',
              'response': ['explanation: The context expl




The model's responses can be distilled into majority votes, as shown below. Given the non-deterministic nature of the LLM (where each inference could yield a different output), we've enhanced stability and self-consistency by averaging results from three LLM output samplings, a notable improvement over single-output scenarios.

In [9]:
for idx, o in enumerate(output):
    majority_vote = o['output'][0]['majority_vote']
    average_score = o['output'][0]['average_score']
    print(f"data {idx} has majority vote \033[31m{majority_vote}\033[0m and average score \033[34m{average_score}\033[0m")

data 0 has majority vote [31myes[0m and average score [34m1.0[0m
data 1 has majority vote [31mno[0m and average score [34m0.0[0m
data 2 has majority vote [31mno[0m and average score [34m0.0[0m
