# Use `AutoRater` to Compare Answers to Given Questions

In this example, we will show you how to use autorater to compare a generated answer to Given Questions.

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)

### Import the dependency
First, we set system paths and import libraries.

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

import pprint

from dotenv import load_dotenv
from IPython.display import display

from uniflow.flow.client import RaterClient
from uniflow.flow.config  import RaterForGeneratedAnswerConfig
from uniflow.op.model.model_config  import OpenAIModelConfig
from uniflow.op.prompt_schema import Context


load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

### Prepare the input data

We use 3 sample raw inputs. Each one is a tuple with context, question, ground truth answer and generated answer to be labeled.  Then we use the `Context` class to wrap them.

In [2]:
raw_input = [
    ("Reddit is an American social news aggregation, content rating, and discussion website. Registered users submit content to the site such as links, text posts, images, and videos, which are then voted up or down by other members.",
     "What type of content can users submit on Reddit?",
     "Users can post comments on Reddit.",
     "Users on Reddit can submit various types of content including links, text posts, images, and videos."), # Better
    ("League of Legends (LoL), commonly referred to as League, is a 2009 multiplayer online battle arena video game developed and published by Riot Games. ",
     "When was League of Legends released?",
     "League of Legends was released in 2009.",
     "League of Legends was released in the early 2000s."), # Worse
    ("Vitamin C (also known as ascorbic acid and ascorbate) is a water-soluble vitamin found in citrus and other fruits, berries and vegetables, also sold as a dietary supplement and as a topical serum ingredient to treat melasma (dark pigment spots) and wrinkles on the face.",
     "Is Vitamin C water-soluble?",
     "Yes, Vitamin C is a very water-soluble vitamin.",
     "Yes, Vitamin C can be dissolved in water well."), # Equally good
]

data = [
    Context(context=c[0], question=c[1], grounding_answer=c[2], generated_answer=c[3])
    for c in raw_input
]

### Set up the config

In this example, we use the [`OpenAIModelConfig`](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L17) as the default LLM to generate questions and answers. If you want to use open-source models, you can replace `OpenAIConfig` and `OpenAIModelConfig` with `HuggingfaceConfig` and [`HuggingfaceModelConfig`](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L27).

We use the default `guided_prompt` in `RaterForGeneratedAnswerConfig`, which contains five examples (one shot per class), labeled as `Strong accept`, `Accept`, `Equivalent`, `Reject` and `Strong reject`. The default examples are also wrapped in the `Context` class with fields of context, question, grounding answer, generated answer (and label), ensuring consistency with the input data.


In [3]:
config = RaterForGeneratedAnswerConfig(
    flow_name="RaterFlow",
    model_config=OpenAIModelConfig(model_name="gpt-4-1106-preview", num_call=3, response_format={"type": "json_object"}),
    label2score={
        "strong accept": 2.0,
        "accept": 1.0,
        "equivalent": 0.0,
        "reject": -1.0,
        "strong reject": -2.0,
    }
)

client = RaterClient(config)

RaterConfig(flow_name='RaterFlow', model_config={'model_name': 'gpt-4-1106-preview', 'model_server': 'OpenAIModelServer', 'num_call': 3, 'temperature': 0.9, 'response_format': {'type': 'json_object'}}, label2score={'strong accept': 2.0, 'accept': 1.0, 'equivalent': 0.0, 'reject': -1.0, 'strong reject': -2.0}, guided_prompt_template=GuidedPrompt(instruction='\n        # Task: Evaluate and compare two answers: a "Generated Answer" and a "Grounding Answer" based on a provided context and question.\n        ## Input: A sample to be labeled:\n        1. context: A brief text containing key information.\n        2. question: A query related to the context, testing knowledge that can be inferred or directly obtained from it.\n        3. grounding Answer: Pre-formulated, usually from human.\n        4. generated Answer: From a language model.\n        ## Evaluation Criteria: Decide which answer is better. Use labels:\n        1. strong accept: Generated better than Grounding\n        2. accept

### Run the client

Then, we can run the client. For each item in the raw input, the Client will generate an explanation and a final label [`Strong accept`, `Accept`, `Equivalent`, `Reject`, `Strong reject`]. The label is determined by taking the majority vote from three samples of the LLM output, which improves stability compared to generating a single output.

In [4]:
output = client.run(data)
pprint.pprint(output)

100%|██████████| 3/3 [00:22<00:00,  7.58s/it]

[{'output': [{'average_score': 2.0,
              'error': 'No errors.',
              'majority_vote': 'strong accept',
              'response': [{'explanation': 'The generated answer provides a '
                                           'comprehensive list of the types of '
                                           'content users can submit on '
                                           'Reddit, which includes links, text '
                                           'posts, images, and videos. This '
                                           'directly corresponds to the '
                                           'context given. The grounding '
                                           'answer is too narrow, mentioning '
                                           'only comments, which may be a '
                                           'feature of Reddit but does not '
                                           'encompass the range of content '
                             




The model's responses can be distilled into majority votes, as shown below. Given the non-deterministic nature of the LLM (where each inference could yield a different output), we've enhanced stability and self-consistency by averaging results from three LLM output samplings, a notable improvement over single-output scenarios.

In [5]:
for idx, o in enumerate(output):
    majority_vote = o['output'][0]['majority_vote']
    average_score = o['output'][0]['average_score']
    print(f"data {idx} has majority vote \033[31m{majority_vote}\033[0m and average score \033[34m{average_score}\033[0m")

data 0 has majority vote [31mstrong accept[0m and average score [34m2.0[0m
data 1 has majority vote [31mstrong reject[0m and average score [34m-1.6666666666666667[0m
data 2 has majority vote [31mequivalent[0m and average score [34m0.0[0m
