# Use `AutoRater` to Compare Answers to Given Questions

Do you need to evaluate the completeness and accuracy of an answer generated by a Large Language Model (LLM) compared to a pre-fomulated answer? In this example, we demonstrate how to use AutoRater for verifying the correctness of a generated answers compared to the grounding answer in relation to given question and context.

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)

### Import the dependency
First, we set system paths and import libraries.

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

import pprint

from dotenv import load_dotenv
from IPython.display import display

from uniflow.flow.client import RaterClient
from uniflow.flow.config  import (
    RaterForGeneratedAnswerOpenAIGPT4Config,
    RaterForGeneratedAnswerOpenAIGPT3p5Config
)
from uniflow.op.prompt import Context
from uniflow.op.op import OpScope


load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

### Prepare the input data

We use three sample raw inputs. Each one is a tuple consisting of context, question, ground truth answer and generated answer to be labeled. Then we use the `Context` class to wrap them.

In [2]:
raw_input = [
    ("Reddit is an American social news aggregation, content rating, and discussion website. Registered users submit content to the site such as links, text posts, images, and videos, which are then voted up or down by other members.",
     "What type of content can users submit on Reddit?",
     "Users can only post text on Reddit.",
     "Users on Reddit can submit various types of content including links, text posts, images, and videos."), # Better
    ("League of Legends (LoL), commonly referred to as League, is a 2009 multiplayer online battle arena video game developed and published by Riot Games. ",
     "When was League of Legends released?",
     "League of Legends was released in 2009.",
     "League of Legends was released in the early 2000s."), # Worse
    ("Vitamin C (also known as ascorbic acid and ascorbate) is a water-soluble vitamin found in citrus and other fruits, berries and vegetables, also sold as a dietary supplement and as a topical serum ingredient to treat melasma (dark pigment spots) and wrinkles on the face.",
     "Is Vitamin C water-soluble?",
     "Yes, Vitamin C is a very water-soluble vitamin.",
     "Yes, Vitamin C can be dissolved in water well."), # Equally good
]

data = [
    Context(context=c[0], question=c[1], grounding_answer=c[2], generated_answer=c[3])
    for c in raw_input
]

## Example 1: Output JSON format using GPT4

In this example, we will use the OpenAI GPT4 Model as the default LLM. If you want to use open-source models, you can replace with Huggingface models in the Uniflow.

We use the default `prompt_template` in `RaterForGeneratedAnswerOpenAIGPT4Config`, which includes the four attributes:
- `flow_name` (str): Name of the rating flow, default is "RaterFlow".
- `model_config` (ModelConfig): Configuration for the GPT-4 model. Includes model name ("gpt-4"), the server ("OpenAIModelServer"), number of calls (1), temperature (0), and the response format (plain text).
- `label2score` (Dict[str, float]): Mapping of labels to scores, default is {"accept": 1.0, "equivalent": 0.0, "reject": -1.0}.
- `prompt_template` (PromptTemplate): Template for guided prompts used in rating. Includes instructions for rating, along with examples that detail the context, question, grounding answer, generated answer, label, and explanation for each case.


In [3]:
config = RaterForGeneratedAnswerOpenAIGPT4Config()
pprint.pprint(config)

The label2score label ['equivalent', 'reject'] not in example label.
RaterForGeneratedAnswerOpenAIGPT4Config(flow_name='RaterFlow',
                                        model_config=OpenAIModelConfig(model_name='gpt-4',
                                                                       model_server='OpenAIModelServer',
                                                                       num_call=1,
                                                                       temperature=0,
                                                                       response_format={'type': 'text'}),
                                        label2score={'accept': 1.0,
                                                     'equivalent': 0.0,
                                                     'reject': -1.0},
                                        prompt_template=PromptTemplate(instruction="\n            Compare two answers: a generated answer and a grounding answer based on a provided contex

If we want the response format to be JSON, we need to update two aspects of the default config:
1. Change the `model_name` to "gpt-4-1106-preview", which is the only GPT-4 model that supports the JSON format.
1. Change the `response_format` to a `json_object`.

In [4]:
config.model_config.model_name = "gpt-4-1106-preview"
config.model_config.response_format = {"type": "json_object"}
config.model_config.num_call = 1
config.model_config.temperature = 0.0

Now we can initialize a client. Since we will demonstrate multiple raters in the notebook, we will initialize them under different operation name scopes.

NOTE: The printed information `"The label2score label ['reject', 'equivalent'] not in example label."` is because we only pass one example (label=`accept`) in default `prompt_template` to reduce token consumption when using GPT-4.

In [5]:
with OpScope(name="JSONFlow"):
    client = RaterClient(config)

The label2score label ['equivalent', 'reject'] not in example label.
RaterConfig(flow_name='RaterFlow', model_config={'model_name': 'gpt-4-1106-preview', 'model_server': 'OpenAIModelServer', 'num_call': 1, 'temperature': 0.0, 'response_format': {'type': 'json_object'}}, label2score={'accept': 1.0, 'equivalent': 0.0, 'reject': -1.0}, prompt_template=PromptTemplate(instruction="\n            Compare two answers: a generated answer and a grounding answer based on a provided context and question.\n            There are few annotated examples below, consisting of context, question, grounding answer, generated answer, explanation and label.\n            If generated answer is better, you should give a label representing higher score and vise versa. Check label to score dictionary: [('accept', 1.0), ('equivalent', 0.0), ('reject', -1.0)].\n            Your response should only focus on the unlabeled sample, including two fields: explanation and label (one of ['accept', 'equivalent', 'reject']

### Run the client

Then, we can run the client. For each item in the raw input, the Client will generate an explanation and a final label in [`Accept`, `Equivalent`, `Reject`]. 

In [6]:
output = client.run(data)
pprint.pprint(output)

100%|██████████| 3/3 [00:26<00:00,  8.73s/it]

[{'output': [{'error': 'No errors.',
              'response': [{'average_score': 1.0,
                            'majority_vote': 'accept',
                            'samples': [{'explanation': 'The grounding answer '
                                                        'is incorrect as it '
                                                        'states that users can '
                                                        'only post text on '
                                                        'Reddit, which '
                                                        'contradicts the '
                                                        'context provided that '
                                                        'clearly states users '
                                                        'can submit links, '
                                                        'text posts, images, '
                                                        'and videos. The '
  




We can see that model response is a JSON object.

In [7]:
pprint.pprint(output[0]["output"][0]["response"][0])

{'average_score': 1.0,
 'majority_vote': 'accept',
 'samples': [{'explanation': 'The grounding answer is incorrect as it states '
                             'that users can only post text on Reddit, which '
                             'contradicts the context provided that clearly '
                             'states users can submit links, text posts, '
                             'images, and videos. The generated answer '
                             'accurately reflects the context by listing all '
                             'the types of content that can be submitted on '
                             'Reddit. Therefore, the generated answer is '
                             'better.',
              'label': 'accept'}],
 'scores': [1.0],
 'votes': ['accept']}


We only sample LLM once so the majority vote is the only label for each item.

In [8]:
for idx, o in enumerate(output):
    majority_vote = o['output'][0]['response'][0]['majority_vote']
    average_score = o['output'][0]['response'][0]['average_score']
    print(f"data {idx} has label \033[31m{majority_vote}\033[0m and score \033[34m{average_score}\033[0m")

data 0 has label [31maccept[0m and score [34m1.0[0m
data 1 has label [31mreject[0m and score [34m-1.0[0m
data 2 has label [31mequivalent[0m and score [34m0.0[0m


## Example 2: Output text format using GPT3.5

Following the previous settings, we will keep the default config `response_format={"type": "text"}`, so the model will output plain text instead of a JSON object. In this case, AutoRater will use a regex to match the label. Furthermore, we will change `num_call` to 3. This means the model will perform inference on each example three times, allowing us to take the majority vote of the ratings.

In [9]:
config2 = RaterForGeneratedAnswerOpenAIGPT3p5Config()
config2.model_config.num_call = 3
config2.model_config.temperature = 0.9

with OpScope(name="TextFlow"):
    client2 = RaterClient(config2)

RaterConfig(flow_name='RaterFlow', model_config={'model_name': 'gpt-3.5-turbo-1106', 'model_server': 'OpenAIModelServer', 'num_call': 3, 'temperature': 0.9, 'response_format': {'type': 'text'}}, label2score={'accept': 1.0, 'equivalent': 0.0, 'reject': -1.0}, prompt_template=PromptTemplate(instruction="\n            # Task: Evaluate and compare two answers: a generated answer and a grounding answer based on a provided context and question.\n            ## Input: A sample to be labeled:\n            1. context: A brief text containing key information.\n            2. question: A query related to the context, testing knowledge that can be inferred or directly obtained from it.\n            3. grounding Answer: Pre-formulated, usually from human.\n            4. generated Answer: From a language model.\n            ## Evaluation Criteria: If generated answer is better, you should give a label representing higher score and vise versa. Check label to score dictionary: [('accept', 1.0), ('equ

### Run the client

Then we can run the client. For each item in the `raw_input`, the label is determined by taking the majority vote from three samples of the LLM's output.

In [10]:
output = client2.run(data)
pprint.pprint(output)

100%|██████████| 3/3 [00:03<00:00,  1.28s/it]

[{'output': [{'error': 'No errors.',
              'response': [{'average_score': 1.0,
                            'majority_vote': 'accept',
                            'samples': ['explanation: The generated answer is '
                                        'better because it accurately '
                                        'identifies the various types of '
                                        'content that users can submit on '
                                        'Reddit, which is consistent with the '
                                        'information provided in the context.\n'
                                        'label: accept',
                                        'explanation: The generated answer is '
                                        'better because it accurately '
                                        'identifies the various types of '
                                        'content that users can submit on '
                                  




Given the non-deterministic nature of the LLM (where each inference could yield a different output), we've enhanced stability and self-consistency by averaging results from multiple LLM output samplings, a notable improvement over single-output scenarios.

In [11]:
for idx, o in enumerate(output):
    majority_vote = o['output'][0]['response'][0]['majority_vote']
    average_score = o['output'][0]['response'][0]['average_score']
    print(f"data {idx} has major vote \033[31m{majority_vote}\033[0m and average score \033[34m{average_score}\033[0m")

data 0 has major vote [31maccept[0m and average score [34m1.0[0m
data 1 has major vote [31mreject[0m and average score [34m-1.0[0m
data 2 has major vote [31maccept[0m and average score [34m0.6666666666666666[0m


## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>
