# Use `AutoRater` to Evaluate Answer Completeness and Accuracy for Given Questions

Do you need to evaluate the completeness and accuracy of an answer generated by a Large Language Model (LLM)? In this example, we demonstrate how to use AutoRater for verifying the correctness of an answer to a specific question and its context.

### Before running the code

You will need to create a `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)

### Import the dependency
First, we set system paths and import libraries.

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

import pprint

from dotenv import load_dotenv
from IPython.display import display

from uniflow.flow.client import RaterClient
from uniflow.flow.config  import (
    RaterForClassificationOpenAIGPT4Config,
    RaterForClassificationOpenAIGPT3p5Config
)
from uniflow.op.prompt import Context
from uniflow.op.op import OpScope

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

### Prepare the input data

We use three example raw inputs. Each one is a tuple consisting of context, question, and answer to be labeled. The ground truth label of the first one is 'correct', and the others are 'incorrect'. Then, we use the `Context` class to wrap them.

In [2]:
raw_input = [
    ("The Pacific Ocean is the largest and deepest of Earth's oceanic divisions. It extends from the Arctic Ocean in the north to the Southern Ocean in the south.",
     "What is the largest ocean on Earth?",
     "The largest ocean on Earth is the Pacific Ocean."), # correct
    ("Shakespeare, a renowned English playwright and poet, wrote 39 plays during his lifetime. His works include famous plays like 'Hamlet' and 'Romeo and Juliet'.",
     "How many plays did Shakespeare write?",
     "Shakespeare wrote 31 plays."), # incorrect
    ("The human brain is an intricate organ responsible for intelligence, memory, and emotions. It is made up of approximately 86 billion neurons.",
     "What is the human brain responsible for?",
     "The human brain is responsible for physical movement."), # incorrect
]

data = [
    Context(context=c[0], question=c[1], answer=c[2])
    for c in raw_input
]

## Example 1: Output JSON format using GPT4

In this example, we will use the OpenAI GPT4 Model as the default LLM. If you want to use open-source models, you can replace with Huggingface models in the Uniflow.

We use the default `guided_prompt` in `RaterForClassificationOpenAIGPT4Config`, which includes the four attributes:
- `flow_name` (str): Name of the rating flow, default is "RaterFlow".
- `model_config` (ModelConfig): Configuration for the GPT-4 model. Includes model name ("gpt-4"), the server ("OpenAIModelServer"), number of calls (1), temperature (0), and the response format (plain text).
- `label2score` (Dict[str, float]): Mapping of labels to scores, default is {"Yes": 1.0, "No": 0.0}.
- `prompt_template` (PromptTemplate): Template for guided prompts used in rating. Includes instructions for rating, along with examples that detail the context, question, answer, label, and explanation for each case.

In [3]:
config = RaterForClassificationOpenAIGPT4Config()
pprint.pprint(config)

RaterForClassificationOpenAIGPT4Config(flow_name='RaterFlow',
                                       model_config=OpenAIModelConfig(model_name='gpt-4',
                                                                      model_server='OpenAIModelServer',
                                                                      num_call=1,
                                                                      temperature=0,
                                                                      response_format={'type': 'text'}),
                                       label2score={'No': 0.0, 'Yes': 1.0},
                                       prompt_template=PromptTemplate(instruction="\n            Evaluate the appropriateness of a given answer based on the question and the context.\n            There are few examples below, consisting of context, question, answer, explanation and label.\n            If answer is appropriate, you should give a label representing higher score and vise versa. C

If we want the response format to be JSON, we need to update two aspects of the default config:
1. Change the `model_name` to "gpt-4-1106-preview", which is the only GPT-4 model that supports the JSON format.
1. Change the `response_format` to a `json_object`.

In [4]:
config.model_config.model_name = "gpt-4-1106-preview"
config.model_config.response_format = {"type": "json_object"}
config.model_config.num_call = 1
config.model_config.temperature = 0.0

Now we can initialize a client. Since we will demonstrate multiple raters in the notebook, we will initialize them under different operation name scopes.

In [5]:
with OpScope(name="JSONFlow"):
    client = RaterClient(config)

RaterConfig(flow_name='RaterFlow', model_config={'model_name': 'gpt-4-1106-preview', 'model_server': 'OpenAIModelServer', 'num_call': 1, 'temperature': 0.0, 'response_format': {'type': 'json_object'}}, label2score={'Yes': 1.0, 'No': 0.0}, prompt_template=PromptTemplate(instruction="\n            Evaluate the appropriateness of a given answer based on the question and the context.\n            There are few examples below, consisting of context, question, answer, explanation and label.\n            If answer is appropriate, you should give a label representing higher score and vise versa. Check label to score dictionary: [('Yes', 1.0), ('No', 0.0)].\n            Your response should only focus on the unlabeled sample, including two fields: explanation and label (one of ['Yes', 'No']).\n            ", few_shot_prompt=[Context(context='The Eiffel Tower, located in Paris, France, is one of the most famous landmarks in the world. It was constructed in 1889 and stands at a height of 324 mete

### Run the client

Then we can run the client. For each item in the `raw_input`, the Client will generate an explanation and a final label, either `Yes` or `No`. 

In [6]:
output = client.run(data)
pprint.pprint(output)

100%|██████████| 3/3 [00:12<00:00,  4.09s/it]

[{'output': [{'error': 'No errors.',
              'response': [{'average_score': 1.0,
                            'majority_vote': 'yes',
                            'samples': [{'explanation': 'The context provided '
                                                        'states that the '
                                                        'Pacific Ocean is the '
                                                        'largest and deepest '
                                                        "of Earth's oceanic "
                                                        'divisions, which '
                                                        'directly answers the '
                                                        'question. Therefore, '
                                                        'the answer given is '
                                                        'correct.',
                                         'label': 'Yes'}],
                           




We can see that model response is a JSON object.

In [7]:
pprint.pprint(output[0]["output"][0]["response"][0])

{'average_score': 1.0,
 'majority_vote': 'yes',
 'samples': [{'explanation': 'The context provided states that the Pacific '
                             "Ocean is the largest and deepest of Earth's "
                             'oceanic divisions, which directly answers the '
                             'question. Therefore, the answer given is '
                             'correct.',
              'label': 'Yes'}],
 'scores': [1.0],
 'votes': ['yes']}


We only sample LLM once so the majority vote is the only label for each item.

In [9]:
for idx, o in enumerate(output):
    majority_vote = o['output'][0]['response'][0]['majority_vote']
    average_score = o['output'][0]['response'][0]['average_score']
    print(f"data {idx} has label \033[31m{majority_vote}\033[0m and score \033[34m{average_score}\033[0m")

data 0 has label [31myes[0m and score [34m1.0[0m
data 1 has label [31mno[0m and score [34m0.0[0m
data 2 has label [31mno[0m and score [34m0.0[0m


## Example 2: Output text format using GPT3.5

Following the previous settings, we will keep the default config `response_format={"type": "text"}`, so the model will output plain text instead of a JSON object. In this case, AutoRater will use a regex to match the label. Furthermore, we will change `num_call` to 3. This means the model will perform inference on each example three times, allowing us to take the majority vote of the ratings.

In [10]:
config2 = RaterForClassificationOpenAIGPT3p5Config()
config2.model_config.num_call = 3
config2.model_config.temperature = 0.9

with OpScope(name="TextFlow"):
    client2 = RaterClient(config2)

RaterConfig(flow_name='RaterFlow', model_config={'model_name': 'gpt-3.5-turbo-1106', 'model_server': 'OpenAIModelServer', 'num_call': 3, 'temperature': 0.9, 'response_format': {'type': 'text'}}, label2score={'Yes': 1.0, 'No': 0.0}, prompt_template=PromptTemplate(instruction="\n            # Task: Evaluate the appropriateness of a given answer based on a provided context and question.\n            ## Input:\n            1. context: A brief text containing key information.\n            2. question: A query related to the context, testing knowledge that can be inferred or directly obtained from it.\n            3. answer: A response to the question.\n            ## Evaluation Criteria: If answer is appropriate, you should give a label representing higher score and vise versa. Check label to score dictionary: [('Yes', 1.0), ('No', 0.0)].\n            ## Response Format: Your response should only include two fields below:\n            1. explanation: Reasoning behind your judgment, explaini

### Run the client

Then we can run the client. For each item in the `raw_input`, the Client will generate an explanation and a final label, either `Yes` or `No`. The label is determined by taking the majority vote from three samples of the LLM's output, which improves stability and self-consistency compared to generating a single output.

In [11]:
output = client2.run(data)
pprint.pprint(output)

100%|██████████| 3/3 [00:03<00:00,  1.03s/it]

[{'output': [{'error': 'No errors.',
              'response': [{'average_score': 1.0,
                            'majority_vote': 'yes',
                            'samples': ['explanation: The answer correctly '
                                        'identifies the Pacific Ocean as the '
                                        'largest ocean on Earth, which is '
                                        'explicitly stated in the context.\n'
                                        'label: Yes',
                                        'explanation: The context directly '
                                        'states that the Pacific Ocean is the '
                                        'largest ocean on Earth, so the answer '
                                        'is correct.\n'
                                        'label: Yes',
                                        'explanation: The answer correctly '
                                        'identifies the Pacific Ocean as




The model's responses can be distilled into majority votes, as shown below. Given the non-deterministic nature of the LLM (where each inference could yield a different output), we've enhanced stability and self-consistency by averaging results from three LLM output samplings, a notable improvement over single-output scenarios.

In [12]:
for idx, o in enumerate(output):
    majority_vote = o['output'][0]['response'][0]['majority_vote']
    average_score = o['output'][0]['response'][0]['average_score']
    print(f"data {idx} has major vote \033[31m{majority_vote}\033[0m and average score \033[34m{average_score}\033[0m")

data 0 has major vote [31myes[0m and average score [34m1.0[0m
data 1 has major vote [31mno[0m and average score [34m0.0[0m
data 2 has major vote [31mno[0m and average score [34m0.0[0m


## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>
