# Use `AutoRater` to Evaluate Answer Completeness and Accuracy for Given Questions

Do you need to evaluate the completeness and accuracy of an answer generated by a Large Language Model (LLM)? In this example, we demonstrate how to use AutoRater for verifying the correctness of an answer to a specific question and its context, using open-source Huggingface models.

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

### Import the dependency
First, we set system paths, install and import libraries.

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

!{sys.executable} -m pip install transformers accelerate bitsandbytes scipy



In [2]:
import pprint

from dotenv import load_dotenv
from IPython.display import display

from uniflow.flow.client import RaterClient
from uniflow.flow.config  import (
    RaterForClassificationHuggingfaceConfig,
    HuggingfaceModelConfig,
)
from uniflow.op.prompt import Context
from uniflow.op.op import OpScope

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

### Prepare the input data

We use three example raw inputs. Each one is a tuple consisting of context, question, and answer to be labeled. The ground truth label of the first one is 'correct', and the others are 'incorrect'. Then, we use the `Context` class to wrap them.

In [3]:
raw_input = [
    ("The Pacific Ocean is the largest and deepest of Earth's oceanic divisions. It extends from the Arctic Ocean in the north to the Southern Ocean in the south.",
     "What is the largest ocean on Earth?",
     "The largest ocean on Earth is the Pacific Ocean."), # correct
    ("Shakespeare, a renowned English playwright and poet, wrote 39 plays during his lifetime. His works include famous plays like 'Hamlet' and 'Romeo and Juliet'.",
     "How many plays did Shakespeare write?",
     "Shakespeare wrote 31 plays."), # incorrect
    ("The human brain is an intricate organ responsible for intelligence, memory, and emotions. It is made up of approximately 86 billion neurons.",
     "What is the human brain responsible for?",
     "The human brain is responsible for physical movement."), # incorrect
]

data = [
    Context(context=c[0], question=c[1], answer=c[2])
    for c in raw_input
]

## Example 1: Output JSON format using GPT4

In this example, we will use the Mistral-Instruct-7b model as the default LLM. If you want to use open-source models, you can replace with Huggingface models.

We use the default `guided_prompt` in `RaterForClassificationHuggingfaceConfig`, which includes the four attributes:
- `flow_name` (str): Name of the rating flow, default is "RaterFlow".
- `model_config` (ModelConfig): Configuration for the huggingface model. Configuration for the huggingeface model. Includes model_name("mistralai/Mistral-7B-Instruct-v0"), model_server ("HuggingfaceModelServer"), batch_size (1), neuron (False), load_in_4bit (False), load_in_8bit (True)
- `label2score` (Dict[str, float]): Mapping of labels to scores, default is {"Yes": 1.0, "No": 0.0}.
- `guided_prompt_template` (GuidedPrompt): Template for guided prompts used in rating. Includes instructions for rating, along with examples that detail the context, question, answer, label, and explanation for each case.

In [4]:
config = RaterForClassificationHuggingfaceConfig(
    model_config=HuggingfaceModelConfig(batch_size=1)
)
client = RaterClient(config)

RaterConfig(flow_name='RaterFlow', model_config={'model_name': 'mistralai/Mistral-7B-Instruct-v0.1', 'model_server': 'HuggingfaceModelServer', 'batch_size': 1, 'neuron': False, 'load_in_4bit': False, 'load_in_8bit': True}, label2score={'Yes': 1.0, 'No': 0.0}, prompt_template=PromptTemplate(instruction="Evaluate if a given answer is appropriate based on the question and the context.\n            Follow the format of the examples below, consisting of context, question, answer, explanation and label (you must choose one from ['Yes', 'No']).", few_shot_prompt=[Context(context='The Eiffel Tower, located in Paris, France, is one of the most famous landmarks in the world. It was constructed in 1889 and stands at a height of 324 meters.', question='When was the Eiffel Tower constructed?', answer='The Eiffel Tower was constructed in 1889.', explanation='The answer is consistency to the fact that Eiffel Tower was constructed in 1889 mentioned in context, so the answer is correct.', label='Yes'),

Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.13s/it]


### Run the client

Then we can run the client. For each item in the `raw_input`, the Client will generate an explanation and a final label, either `Yes` or `No`. 

In [5]:
output = client.run(data)
pprint.pprint(output)

100%|██████████| 3/3 [00:11<00:00,  3.83s/it]

[{'output': [{'average_score': 1.0,
              'error': 'No errors.',
              'majority_vote': 'yes',
              'response': ['instruction: Evaluate if a given answer is '
                           'appropriate based on the question and the '
                           'context.\n'
                           '            Follow the format of the examples '
                           'below, consisting of context, question, answer, '
                           'explanation and label (you must choose one from '
                           "['Yes', 'No']).\n"
                           'context: The Eiffel Tower, located in Paris, '
                           'France, is one of the most famous landmarks in the '
                           'world. It was constructed in 1889 and stands at a '
                           'height of 324 meters.\n'
                           'question: When was the Eiffel Tower constructed?\n'
                           'answer: The Eiffel Tower was




In [6]:
pprint.pprint(output[0]["output"][0]["response"][0])

('instruction: Evaluate if a given answer is appropriate based on the question '
 'and the context.\n'
 '            Follow the format of the examples below, consisting of context, '
 "question, answer, explanation and label (you must choose one from ['Yes', "
 "'No']).\n"
 'context: The Eiffel Tower, located in Paris, France, is one of the most '
 'famous landmarks in the world. It was constructed in 1889 and stands at a '
 'height of 324 meters.\n'
 'question: When was the Eiffel Tower constructed?\n'
 'answer: The Eiffel Tower was constructed in 1889.\n'
 'explanation: The answer is consistency to the fact that Eiffel Tower was '
 'constructed in 1889 mentioned in context, so the answer is correct.\n'
 'label: Yes\n'
 'context: Photosynthesis is a process used by plants to convert light energy '
 'into chemical energy. This process primarily occurs in the chloroplasts of '
 'plant cells.\n'
 'question: Where does photosynthesis primarily occur in plant cells?\n'
 'answer: Photosynth

In [7]:
for idx, o in enumerate(output):
    majority_vote = o['output'][0]['majority_vote']
    average_score = o['output'][0]['average_score']
    print(f"data {idx} has label \033[31m{majority_vote}\033[0m and score \033[34m{average_score}\033[0m")

data 0 has label [31myes[0m and score [34m1.0[0m
data 1 has label [31mno[0m and score [34m0.0[0m
data 2 has label [31mno[0m and score [34m0.0[0m
