# Use AutoRater to Compare Answers to a Given Question from a Jupyter Notebook

In this example, we will show you how to use autorater to compare a generated answer to a Given Question from a given jupyter notebook.

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)

### Import dependency
First, we set system paths and import libraries.

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

import pprint

from dotenv import load_dotenv
from IPython.display import display

from uniflow.flow.client import RaterClient
from uniflow.flow.config  import RaterForGeneratedAnswerConfig
from uniflow.op.model.model_config  import OpenAIModelConfig
from uniflow.op.prompt_schema import Context


load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

### Prepare the input data

We use 3 example data. Each one is a tuple with context, question, grounding answer and generated answer to be labeled.  Then we use `Context` class to wrap them.

In [2]:
raw_input = [
    ("Reddit is an American social news aggregation, content rating, and discussion website. Registered users submit content to the site such as links, text posts, images, and videos, which are then voted up or down by other members.",
     "What type of content can users submit on Reddit?",
     "Users can post comments on Reddit.",
     "Users on Reddit can submit various types of content including links, text posts, images, and videos."), # Better
    ("League of Legends (LoL), commonly referred to as League, is a 2009 multiplayer online battle arena video game developed and published by Riot Games. ",
     "When was League of Legends released?",
     "League of Legends was released in 2009.",
     "League of Legends was released in the early 2000s."), # Worse
    ("Vitamin C (also known as ascorbic acid and ascorbate) is a water-soluble vitamin found in citrus and other fruits, berries and vegetables, also sold as a dietary supplement and as a topical serum ingredient to treat melasma (dark pigment spots) and wrinkles on the face.",
     "Is Vitamin C water-soluble?",
     "Yes, Vitamin C is a very water-soluble vitamin.",
     "Yes, Vitamin C can be dissolved in water well."), # Equally good
]
data = [
    Context(context=c[0], question=c[1], grounding_answer=c[2], generated_answer=c[3])
    for c in raw_input
]

### Set up config

In this example, we will use the [`OpenAIModelConfig`](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L17) as the default LLM to generate questions and answers. If you want to use open-source models, you can replace the `OpenAIConfig` and `OpenAIModelConfig` with `HuggingfaceConfig` and [`HuggingfaceModelConfig`](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L27).

We use the default `guided_prompt` in `RaterForGeneratedAnswerConfig`, which contains five examples(one shot per class), labeled by `Strong accept`, `Accept`, `Equivalent`, `Reject` and `Strong reject`. The default examples are also wrap by `Context` class with fileds of context, question, grounding answer, generated answer (and label), consistent with input data.


In [3]:
config = RaterForGeneratedAnswerConfig(
    flow_name="RaterFlow",
    model_config=OpenAIModelConfig(model_name="gpt-4-1106-preview", num_call=3, response_format={"type": "text"}),
    label2score={
        "strong accept": 2.0,
        "accept": 1.0,
        "equivalent": 0.0,
        "reject": -1.0,
        "strong reject": -2.0,
    }
)
client = RaterClient(config)

RaterConfig(flow_name='RaterFlow', model_config={'model_name': 'gpt-4-1106-preview', 'model_server': 'OpenAIModelServer', 'num_call': 3, 'temperature': 0.9, 'response_format': {'type': 'json_object'}}, label2score={'strong accept': 2.0, 'accept': 1.0, 'equivalent': 0.0, 'reject': -1.0, 'strong reject': -2.0}, guided_prompt_template=GuidedPrompt(instruction='Rate the generated answer compared to the grounding answer to the question. Accept means the generated answer is better than the grounding answer and reject means worse.\n        Follow the format of the examples below to include context, question, grounding answer, generated answer and label in the response.\n        The response should not include examples in the prompt.', examples=[Context(context='Basic operating system features were developed in the 1950s, and more complex functions were introduced in the 1960s.', question='When were basic operating system features developed?', grounding_answer='In the 1960s, people developed s

### Run client

Then we can run the client. For each item in the raw_input, the Client will generate an explanation and a final label [`Strong accept`, `Accept`, `Equivalent`, `Reject`, `Strong reject`] . The label is decided by taking the majority votes from sampling the LLM output 3 times, which improved stability compared with outputting 1 time.

In [4]:
output = client.run(data)
pprint.pprint(output)

100%|██████████| 3/3 [00:28<00:00,  9.35s/it]

[{'output': [{'average_score': 1.0,
              'error': 'No errors.',
              'majority_vote': 'accept',
              'response': [{'context': 'Reddit is an American social news '
                                       'aggregation, content rating, and '
                                       'discussion website. Registered users '
                                       'submit content to the site such as '
                                       'links, text posts, images, and videos, '
                                       'which are then voted up or down by '
                                       'other members.',
                            'explanation': 'The generated answer is better as '
                                           'it provides a more comprehensive '
                                           'list of the types of content that '
                                           'users can submit on Reddit, '
                                           'closely




In [5]:
for idx, o in enumerate(output):
    majority_vote = o['output'][0]['majority_vote']
    average_score = o['output'][0]['average_score']
    print(f"data {idx} has majority vote \033[31m{majority_vote}\033[0m and average score \033[34m{average_score}\033[0m")

data 0 has majority vote [31maccept[0m and average score [34m1.0[0m
data 1 has majority vote [31mreject[0m and average score [34m-1.0[0m
data 2 has majority vote [31mequivalent[0m and average score [34m0.0[0m
