# Use AutoRater to Assess Question Answer Accuracy using Bedrock from a Jupyter Notebook

In this example, we will show you how to use AutoRater to verify the correctness of an answer to a given question and context pairs.

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [AWS CLI profile](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html) to run the code. You can set up the profile by running `aws configure --profile <profile_name>` in your terminal. 

### Import dependency
First, we set system paths and import libraries.

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

In [2]:
import pprint

from dotenv import load_dotenv
from IPython.display import display

from uniflow.flow.client import RaterClient
from uniflow.flow.flow_factory import FlowFactory
from uniflow.flow.config  import RaterClassificationConfig
from uniflow.op.model.model_config import BedrockModelConfig
from uniflow.viz import Viz
from uniflow.op.prompt_schema import Context

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

In [3]:
FlowFactory.list()

{'extract': ['ExtractIpynbFlow',
  'ExtractMarkdownFlow',
  'ExtractPDFFlow',
  'ExtractTxtFlow'],
 'transform': ['TransformAzureOpenAIFlow',
  'TransformCopyFlow',
  'TransformHuggingFaceFlow',
  'TransformLMQGFlow',
  'TransformOpenAIFlow'],
 'rater': ['RaterFlow']}

### Prepare the input data

We use 3 example data. Each one is a tuple with context, question and answer to be labeled. The grounding truth label of first one is correct and other are incorrect. Then we use `Context` class to wrap them.
   

In [4]:
raw_input = [
    ("The Pacific Ocean is the largest and deepest of Earth's oceanic divisions. It extends from the Arctic Ocean in the north to the Southern Ocean in the south.",
     "What is the largest ocean on Earth?",
     "The largest ocean on Earth is the Pacific Ocean."), # correct
    ("Shakespeare, a renowned English playwright and poet, wrote 39 plays during his lifetime. His works include famous plays like 'Hamlet' and 'Romeo and Juliet'.",
     "How many plays did Shakespeare write?",
     "Shakespeare wrote 39 plays."), # correct
    ("The human brain is an intricate organ responsible for intelligence, memory, and emotions. It is made up of approximately 86 billion neurons.",
     "What is the human brain responsible for?",
     "The human brain is responsible for physical movement."), # incorrect
]

In [5]:
data = [
    Context(context=c[0], question=c[1], answer=c[2])
    for c in raw_input
]

### Set up config: JSON format

In this example, we will use the BedrockModelConfig as the default LLM to generate questions and answers. If you want to use open-source models, you can replace the `BedrockModelConfig` with `HuggingfaceConfig` and [`HuggingfaceModelConfig`](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L27).

We use the default `guided_prompt` in `RaterClassificationConfig`, which contains two examples, labeled by Yes and No. The default examples are also wrap by `Context` class with fileds of context, question, answer (and label), consistent with input data.

The response format is `json`, so the model returns json object as output instead of plain text, which can be processed more conveniently. 

In [6]:
config = RaterClassificationConfig(
    flow_name="RaterFlow",
    model_config=BedrockModelConfig(aws_profile="default", aws_region="us-west-2", model_kwargs={'temperature': 0.1}),
    label2score={"Yes": 1.0, "No": 0.0})
client = RaterClient(config)

RaterConfig(flow_name='RaterFlow', model_config={'aws_region': 'us-west-2', 'aws_profile': 'default', 'aws_access_key_id': '', 'aws_secret_access_key': '', 'aws_session_token': '', 'model_name': 'anthropic.claude-v2', 'batch_size': 1, 'model_server': 'BedrockModelServer', 'model_kwargs': {'temperature': 0.1}}, label2score={'Yes': 1.0, 'No': 0.0}, guided_prompt_template=GuidedPrompt(instruction='Rate the answer based on the question and the context.\n        Follow the format of the examples below to include context, question, answer, and label in the response.\n        The response should not include examples in the prompt.', examples=[Context(context='The Eiffel Tower, located in Paris, France, is one of the most famous landmarks in the world. It was constructed in 1889 and stands at a height of 324 meters.', question='When was the Eiffel Tower constructed?', answer='The Eiffel Tower was constructed in 1889.', explanation='The context explicitly mentions that the Eiffel Tower was cons

### Run client

Then we can run the client. For each item in the raw_input, the Client will generate an explanation and a final label Yes or No. The label is decided by taking the majority votes from sampling the LLM output 3 times, which improved stability and self-consistency compared with outputting 1 time.
   

In [7]:
output = client.run(data)
pprint.pprint(output)

  0%|          | 0/3 [00:00<?, ?it/s]

100%|██████████| 3/3 [00:12<00:00,  4.14s/it]

[{'output': [{'average_score': 1.0,
              'error': 'No errors.',
              'majority_vote': 'yes',
              'response': [' context: The Eiffel Tower, located in Paris, '
                           'France, is one of the most famous landmarks in the '
                           'world. It was constructed in 1889 and stands at a '
                           'height of 324 meters.\n'
                           'question: When was the Eiffel Tower constructed?\n'
                           'answer: The Eiffel Tower was constructed in 1889.\n'
                           'label: Yes\n'
                           '\n'
                           'context: Photosynthesis is a process used by '
                           'plants to convert light energy into chemical '
                           'energy. This process primarily occurs in the '
                           'chloroplasts of plant cells.  \n'
                           'question: Where does photosynthesis primarily '
 




We can see that model response is a json object.

In [8]:
pprint.pprint(output[0]["output"][0]["response"][0])

(' context: The Eiffel Tower, located in Paris, France, is one of the most '
 'famous landmarks in the world. It was constructed in 1889 and stands at a '
 'height of 324 meters.\n'
 'question: When was the Eiffel Tower constructed?\n'
 'answer: The Eiffel Tower was constructed in 1889.\n'
 'label: Yes\n'
 '\n'
 'context: Photosynthesis is a process used by plants to convert light energy '
 'into chemical energy. This process primarily occurs in the chloroplasts of '
 'plant cells.  \n'
 'question: Where does photosynthesis primarily occur in plant cells?\n'
 'answer: Photosynthesis primarily occurs in the mitochondria of plant cells.\n'
 'label: No\n'
 '\n'
 "context: The Pacific Ocean is the largest and deepest of Earth's oceanic "
 'divisions. It extends from the Arctic Ocean in the north to the Southern '
 'Ocean in the south.\n'
 'question: What is the largest ocean on Earth?  \n'
 'answer: The largest ocean on Earth is the Pacific Ocean.\n'
 'label: Yes')


In [9]:
for idx, o in enumerate(output):
    majority_vote = o['output'][0]['majority_vote']
    average_score = o['output'][0]['average_score']
    print(f"data {idx} has majority vote {majority_vote} and average score {average_score}")

data 0 has majority vote yes and average score 1.0
data 1 has majority vote yes and average score 1.0
data 2 has majority vote no and average score 0.0
