# Use `AutoRater` to Assess Question Answer Accuracy using Bedrock

Do you need to evaluate the completeness and accuracy of an answer generated by a Large Language Model (LLM)? In this example, we will show you how to use AutoRater to verify the correctness of an answer to a given question and context pair.

### Before running the code

You will need to create a `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [AWS CLI profile](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html) to run the code. You can set up the profile by running `aws configure --profile <profile_name>` in your terminal. You will need to provide your AWS Access Key ID and AWS Secret Access Key. You can find your AWS Access Key ID and AWS Secret Access Key in the [Security Credentials](https://console.aws.amazon.com/iam/home?region=us-east-1#/security_credentials) section of the AWS console.

```bash
$ aws configure --profile <profile_name>
$ AWS Access Key ID [None]: <your_access_key_id>
$ AWS Secret Access Key [None]: <your_secret_access_key>
$ Default region name [None]: us-west-2
$ Default output format [None]: .json
```

Make sure to set `Default output format` to `.json`.

> Note: If you don't have AWS CLI installed, you will get a `command not found: aws` error. You can follow the instructions [here](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).

### Import dependency
First, we set system paths and import libraries.

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

In [2]:
import pprint

from dotenv import load_dotenv
from uniflow.flow.client import RaterClient
from uniflow.flow.config  import RaterForClassificationBedrockClaudeConfig
from uniflow.op.prompt_schema import Context

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

### Install Extra Libraries

In [3]:
!{sys.executable} -m pip install -q boto3



### Prepare the input data

We use three example raw inputs. Each one is a tuple with the context, question and answer to be labeled. The grounding truth label of first two are 'correct' and the last one is 'incorrect'.

In [4]:
raw_input = [
    ("The Pacific Ocean is the largest and deepest of Earth's oceanic divisions. It extends from the Arctic Ocean in the north to the Southern Ocean in the south.",
     "What is the largest ocean on Earth?",
     "The largest ocean on Earth is the Pacific Ocean."), # correct
    ("Shakespeare, a renowned English playwright and poet, wrote 39 plays during his lifetime. His works include famous plays like 'Hamlet' and 'Romeo and Juliet'.",
     "How many plays did Shakespeare write?",
     "Shakespeare wrote 39 plays."), # correct
    ("The human brain is an intricate organ responsible for intelligence, memory, and emotions. It is made up of approximately 86 billion neurons.",
     "What is the human brain responsible for?",
     "The human brain is responsible for physical movement."), # incorrect
]

Then, we use the `Context` class to wrap them for processing by `uniflow`.

In [5]:
data = [
    Context(context=c[0], question=c[1], answer=c[2])
    for c in raw_input
]

### Set up config: JSON format

In this example, we will use the `RaterForClassificationBedrockClaudeConfig` to generate questions and answers. It uses the `Claude v2` model by default.

We use the default `prompt_template` in `RaterForClassificationBedrockClaudeConfig`, which contains two examples, labeled by Yes and No. The default examples are also wrap by `Context` class with fields of context, question, answer (and label), consistent with input data.

The response format is `json`, as specified in our `aws configure`, so the model returns json object as output instead of plain text, which can be processed more conveniently. 

In this example, we will use the Anthropic Claude V2 Model as the default LLM to generate questions and answers.

We use the default `prompt_template` in `RaterForClassificationBedrockClaudeConfig`, which includes the four attributes:
- `flow_name` (str): Name of the rating flow, default is "RaterFlow".
- `model_config` (ModelConfig): Configuration for the Bedrock model. Includes aws_region (""), aws_profile ("default"),
        aws_access_key_id, aws_secret_key_id, aws_secret_access_key, aws_session_token, batch_size(1),
        model name ("anthropic.claude-v2"), batch_size (1), the server ("BedrockModelServer"), and the model_kwargs.
- `label2score` (Dict[str, float]): Mapping of labels to scores, default is {"Yes": 1.0, "No": 0.0}.
- `prompt_template` (GuidedPrompt): Template for guided prompts used in rating. Includes instructions
                                        for rating, along with examples that detail the context, question,
                                        answer, label, and explanation for each case.

The configuration primarily focuses on setting up the parameters for utilizing `Claude v2` to evaluate the correctness of answers in relation to given questions and contexts.

In [6]:
config = RaterForClassificationBedrockClaudeConfig()
pprint.pprint(config)

RaterForClassificationBedrockClaudeConfig(flow_name='RaterFlow',
                                       model_config=BedrockModelConfig(aws_region='',
                                                                       aws_profile='default',
                                                                       aws_access_key_id='',
                                                                       aws_secret_access_key='',
                                                                       aws_session_token='',
                                                                       model_name='anthropic.claude-v2',
                                                                       batch_size=1,
                                                                       model_server='BedrockModelServer',
                                                                       model_kwargs={}),
                                       label2score={'No': 0.0, 'Yes': 1.0},
           

We can customize some parameters in `RaterForClassificationBedrockClaudeConfig` to fit our needs. For one, we need to set the `aws_profile` to match the credentials we set earlier in the `AWS CLI`, and we also have to give it a `aws_region`. We can also set custom parameters like the temperature for the model.

In [7]:
config.model_config.aws_region = "us-west-2"
config.model_config.aws_profile = "default"
config.model_config.model_kwargs = {'temperature': 0.1}

Now, wec anc initialize the `client`.

In [8]:

client = RaterClient(config)

RaterConfig(flow_name='RaterFlow', model_config={'aws_region': 'us-west-2', 'aws_profile': 'default', 'aws_access_key_id': '', 'aws_secret_access_key': '', 'aws_session_token': '', 'model_name': 'anthropic.claude-v2', 'batch_size': 1, 'model_server': 'BedrockModelServer', 'model_kwargs': {'temperature': 0.1}}, label2score={'Yes': 1.0, 'No': 0.0}, prompt_template=GuidedPrompt(instruction="Rate the answer based on the question and the context.\n        Follow the format of the examples below to include context, question, answer, and label in the response.\n        The response should not include examples in the prompt. The response label should be one of the following: ['Yes', 'No'].", examples=[Context(context='The Eiffel Tower, located in Paris, France, is one of the most famous landmarks in the world. It was constructed in 1889 and stands at a height of 324 meters.', question='When was the Eiffel Tower constructed?', answer='The Eiffel Tower was constructed in 1889.', explanation='The

### Run the client

Then we can run the client. For each item in the `raw_input`, the Client will generate an explanation and a final label `Yes` or `No`.
   

In [None]:
output = client.run(data)
pprint.pprint(output)

  0%|          | 0/3 [00:00<?, ?it/s]

100%|██████████| 3/3 [00:12<00:00,  4.14s/it]

[{'output': [{'average_score': 1.0,
              'error': 'No errors.',
              'majority_vote': 'yes',
              'response': [' context: The Eiffel Tower, located in Paris, '
                           'France, is one of the most famous landmarks in the '
                           'world. It was constructed in 1889 and stands at a '
                           'height of 324 meters.\n'
                           'question: When was the Eiffel Tower constructed?\n'
                           'answer: The Eiffel Tower was constructed in 1889.\n'
                           'label: Yes\n'
                           '\n'
                           'context: Photosynthesis is a process used by '
                           'plants to convert light energy into chemical '
                           'energy. This process primarily occurs in the '
                           'chloroplasts of plant cells.  \n'
                           'question: Where does photosynthesis primarily '
 




In [None]:
pprint.pprint(output[0]["output"][0]["response"][0])

(' context: The Eiffel Tower, located in Paris, France, is one of the most '
 'famous landmarks in the world. It was constructed in 1889 and stands at a '
 'height of 324 meters.\n'
 'question: When was the Eiffel Tower constructed?\n'
 'answer: The Eiffel Tower was constructed in 1889.\n'
 'label: Yes\n'
 '\n'
 'context: Photosynthesis is a process used by plants to convert light energy '
 'into chemical energy. This process primarily occurs in the chloroplasts of '
 'plant cells.  \n'
 'question: Where does photosynthesis primarily occur in plant cells?\n'
 'answer: Photosynthesis primarily occurs in the mitochondria of plant cells.\n'
 'label: No\n'
 '\n'
 "context: The Pacific Ocean is the largest and deepest of Earth's oceanic "
 'divisions. It extends from the Arctic Ocean in the north to the Southern '
 'Ocean in the south.\n'
 'question: What is the largest ocean on Earth?  \n'
 'answer: The largest ocean on Earth is the Pacific Ocean.\n'
 'label: Yes')


The model's responses can be distilled into majority votes, as shown below. Given the non-deterministic nature of the LLM (where each inference could yield a different output), we've enhanced stability and self-consistency by averaging results from three LLM output samplings, a notable improvement over single-output scenarios.

In [None]:
for idx, o in enumerate(output):
    majority_vote = o['output'][0]['response'][0]['majority_vote']
    average_score = o['output'][0]['response'][0]['average_score']
    print(f"data {idx} has label \033[31m{majority_vote}\033[0m and score \033[34m{average_score}\033[0m")

data 0 has majority vote [31myes[0m and average score [34m1.0[0m
data 1 has majority vote [31myes[0m and average score [34m1.0[0m
data 2 has majority vote [31mno[0m and average score [34m0.0[0m


## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>
