# Use `AutoRater` to Evaluate Answer Completeness and Accuracy for Given Questions using Huggingface Open Source Models

Do you need to evaluate the completeness and accuracy of an answer generated by a Large Language Model (LLM)? In this example, we demonstrate how to use AutoRater for verifying the correctness of an answer to a specific question and its context, using open-source Huggingface models.

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

### Import the dependency
First, we set system paths, install and import libraries.

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

!{sys.executable} -m pip install -q transformers accelerate bitsandbytes scipy

In [2]:
import pprint

from dotenv import load_dotenv
from IPython.display import display

from uniflow.flow.client import RaterClient
from uniflow.flow.config  import (
    RaterForClassificationHuggingfaceConfig,
    HuggingfaceModelConfig,
)
from uniflow.op.prompt import Context
from uniflow.op.op import OpScope

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

### Prepare the input data

We use three example raw inputs. Each one is a tuple consisting of context, question, and answer to be labeled. The ground truth label of the first one is 'correct', and the others are 'incorrect'. Then, we use the `Context` class to wrap them.

In [3]:
raw_input = [
    ("The Pacific Ocean is the largest and deepest of Earth's oceanic divisions. It extends from the Arctic Ocean in the north to the Southern Ocean in the south.",
     "What is the largest ocean on Earth?",
     "The largest ocean on Earth is the Pacific Ocean."), # correct
    ("Shakespeare, a renowned English playwright and poet, wrote 39 plays during his lifetime. His works include famous plays like 'Hamlet' and 'Romeo and Juliet'.",
     "How many plays did Shakespeare write?",
     "Shakespeare wrote 31 plays."), # incorrect
    ("The human brain is an intricate organ responsible for intelligence, memory, and emotions. It is made up of approximately 86 billion neurons.",
     "What is the human brain responsible for?",
     "The human brain is responsible for physical movement."), # incorrect
]

data = [
    Context(context=c[0], question=c[1], answer=c[2])
    for c in raw_input
]

## Example 1: Output JSON format using Mistral-7B-Instruct-v0.2

In this example, we will use the Mistral-Instruct-7b model as the default LLM. If you want to use open-source models, you can replace with Huggingface models.

We use the default `guided_prompt` in `RaterForClassificationHuggingfaceConfig`, which includes the four attributes:
- `flow_name` (str): Name of the rating flow, default is "RaterFlow".
- `model_config` (ModelConfig): Configuration for the huggingface model. Configuration for the huggingeface model. Includes model_name("mistralai/Mistral-7B-Instruct-v0.2"), model_server ("HuggingfaceModelServer"), batch_size (1), neuron (False), load_in_4bit (False), load_in_8bit (True), responese_start_key("exaplanation"), response_format({"type": "json_object"})
- `label2score` (Dict[str, float]): Mapping of labels to scores, default is {"Yes": 1.0, "No": 0.0}.
- `prompt_template` (GuidedPrompt): Template for guided prompts used in rating. Includes instructions for rating, along with examples that detail the context, question, answer, label, and explanation for each case.

NOTE: In `model_config`, `response_format` decides whether model generates a plain text or a json object. `response_start_key` is what you want model to first generate. Because we are using chain of thoughts (CoT) prompt in default, so the first generate field is `explanation` (see default `few_shot_prompt`).

Now we can initialize a client. Since we will demonstrate multiple raters in the notebook, we will initialize them under different operation name scopes.

In [4]:
config = RaterForClassificationHuggingfaceConfig(
    model_config=HuggingfaceModelConfig(
        response_start_key="explanation", 
        response_format={"type": "json_object"},
        batch_size=1
    )
)
with OpScope(name="JSONFlow"):
    client = RaterClient(config)

RaterConfig(flow_name='RaterFlow', model_config={'model_name': 'mistralai/Mistral-7B-Instruct-v0.2', 'model_server': 'HuggingfaceModelServer', 'batch_size': 1, 'neuron': False, 'load_in_4bit': False, 'load_in_8bit': True, 'max_new_tokens': 768, 'do_sample': False, 'temperature': 0.0, 'num_beams': 1, 'num_return_sequences': 1, 'repetition_penalty': 1.2, 'response_start_key': 'explanation', 'response_format': {'type': 'json_object'}}, label2score={'Yes': 1.0, 'No': 0.0}, prompt_template=PromptTemplate(instruction="Evaluate if a given answer is appropriate based on the question and the context.\n            Follow the format of the examples below, consisting of context, question, answer, explanation and label (you must choose one from ['Yes', 'No']).", few_shot_prompt=[Context(context='The Eiffel Tower, located in Paris, France, is one of the most famous landmarks in the world. It was constructed in 1889 and stands at a height of 324 meters.', question='When was the Eiffel Tower construct

Loading checkpoint shards: 100%|██████████| 3/3 [00:02<00:00,  1.29it/s]


### Run the client

Then we can run the client. For each item in the `raw_input`, the Client will generate an explanation and a final label, either `Yes` or `No`. 

In [5]:
output = client.run(data)
pprint.pprint(output)

100%|██████████| 3/3 [00:09<00:00,  3.07s/it]

[{'output': [{'error': 'No errors.',
              'response': [{'average_score': 1.0,
                            'majority_vote': 'yes',
                            'samples': [{'answer': 'The largest ocean on Earth '
                                                   'is the Pacific Ocean.',
                                         'context': 'The Pacific Ocean is the '
                                                    'largest and deepest of '
                                                    "Earth's oceanic "
                                                    'divisions. It extends '
                                                    'from the Arctic Ocean in '
                                                    'the north to the Southern '
                                                    'Ocean in the south.',
                                         'explanation': 'The answer is '
                                                        'consistent with the '
          




In [6]:
pprint.pprint(output[0]["output"][0]["response"][0])

{'average_score': 1.0,
 'majority_vote': 'yes',
 'samples': [{'answer': 'The largest ocean on Earth is the Pacific Ocean.',
              'context': 'The Pacific Ocean is the largest and deepest of '
                         "Earth's oceanic divisions. It extends from the "
                         'Arctic Ocean in the north to the Southern Ocean in '
                         'the south.',
              'explanation': 'The answer is consistent with the fact stated in '
                             'the context that the Pacific Ocean is the '
                             'largest ocean on Earth, so the answer is '
                             'correct.',
              'label': 'Yes',
              'question': 'What is the largest ocean on Earth?'}],
 'scores': [1.0],
 'votes': ['yes']}


In [7]:
for idx, o in enumerate(output):
    majority_vote = o['output'][0]['response'][0]['majority_vote']
    average_score = o['output'][0]['response'][0]['average_score']
    print(f"data {idx} has label \033[31m{majority_vote}\033[0m and score \033[34m{average_score}\033[0m")

data 0 has label [31myes[0m and score [34m1.0[0m
data 1 has label [31mno[0m and score [34m0.0[0m
data 2 has label [31mno[0m and score [34m0.0[0m


## Example 2: Output text format using Mistral-7B-Instruct-v0.2

Following the previous settings, but we will change the default config `response_format={"type": "text"}`, so the model will output plain text instead of a JSON object. In this case, AutoRater will use a regex to match the label. 

In [8]:
config2 = RaterForClassificationHuggingfaceConfig(
    model_config=HuggingfaceModelConfig(
        response_start_key="explanation", 
        response_format={"type": "text"},
        batch_size=1,
    )
)
with OpScope(name="TextFlow"):
    client2 = RaterClient(config2)

RaterConfig(flow_name='RaterFlow', model_config={'model_name': 'mistralai/Mistral-7B-Instruct-v0.2', 'model_server': 'HuggingfaceModelServer', 'batch_size': 1, 'neuron': False, 'load_in_4bit': False, 'load_in_8bit': True, 'max_new_tokens': 768, 'do_sample': False, 'temperature': 0.0, 'num_beams': 1, 'num_return_sequences': 1, 'repetition_penalty': 1.2, 'response_start_key': 'explanation', 'response_format': {'type': 'text'}}, label2score={'Yes': 1.0, 'No': 0.0}, prompt_template=PromptTemplate(instruction="Evaluate if a given answer is appropriate based on the question and the context.\n            Follow the format of the examples below, consisting of context, question, answer, explanation and label (you must choose one from ['Yes', 'No']).", few_shot_prompt=[Context(context='The Eiffel Tower, located in Paris, France, is one of the most famous landmarks in the world. It was constructed in 1889 and stands at a height of 324 meters.', question='When was the Eiffel Tower constructed?', a

Loading checkpoint shards: 100%|██████████| 3/3 [00:02<00:00,  1.29it/s]


### Run the client

Then we can run the client. For each item in the `raw_input`, the Client will generate an explanation and a final label, either `Yes` or `No`. 

In [9]:
output = client2.run(data)
pprint.pprint(output)

  0%|          | 0/3 [00:00<?, ?it/s]

100%|██████████| 3/3 [00:08<00:00,  2.96s/it]

[{'output': [{'error': 'No errors.',
              'response': [{'average_score': 1.0,
                            'majority_vote': 'yes',
                            'samples': ['instruction: Evaluate if a given '
                                        'answer is appropriate based on the '
                                        'question and the context.\n'
                                        '            Follow the format of the '
                                        'examples below, consisting of '
                                        'context, question, answer, '
                                        'explanation and label (you must '
                                        "choose one from ['Yes', 'No']).\n"
                                        'context: The Eiffel Tower, located in '
                                        'Paris, France, is one of the most '
                                        'famous landmarks in the world. It was '
                        




In [10]:
for idx, o in enumerate(output):
    majority_vote = o['output'][0]['response'][0]['majority_vote']
    average_score = o['output'][0]['response'][0]['average_score']
    print(f"data {idx} has label \033[31m{majority_vote}\033[0m and score \033[34m{average_score}\033[0m")

data 0 has label [31myes[0m and score [34m1.0[0m
data 1 has label [31mno[0m and score [34m0.0[0m
data 2 has label [31mno[0m and score [34m0.0[0m


## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>
