# Step 4 - Evaluate Model Quality

## Introduction

Quality evaluation is the second critical dimension in our model migration framework, alongside latency and cost. Assessing response quality ensures that your migrated solution maintains or improves upon the capabilities of your source model.

In this notebook, we'll implement automated quality evaluation using **[Amazon Bedrock's LLM-as-a-Judge](https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-judge.html)** feature, which leverages foundation models to objectively assess model outputs across multiple dimensions.

### Understanding LLM-as-a-Judge Evaluation

LLM-as-a-Judge works by:

1. Taking your model's responses along with optional reference answers (ground truth)
2. Using a specialized evaluator model to assess the quality of those responses
3. Providing standardized scores across dimensions like correctness, completeness, and style

This approach offers several advantages over traditional human evaluation:
- **Consistency**: Applies the same standards across all evaluated responses
- **Scalability**: Can evaluate thousands of responses quickly
- **Objectivity**: Reduces human bias in the evaluation process
- **Reproducibility**: Produces consistent results for the same inputs

### Our Evaluation Approach

For this workshop, we will:
1. Format the responses generated in Step 3 for evaluation
2. Create separate evaluation jobs for each model (source and candidates)
3. Upload our datasets to S3 for processing
4. Configure and start the evaluation jobs
5. Store job references for analysis in the next step

Let's begin by ensuring we have the necessary dependencies installed:


In [1]:
import boto3
import json
import random
from datetime import datetime
from typing import List, Dict, Any, Optional
import pandas as pd
import glob
import os
from IPython.display import display

# Initialize AWS clients
bedrock_client = boto3.client('bedrock')
s3_client = boto3.client('s3')


## Prerequisites

For this notebook to run successfully, you'll need:

- **IAM Role with Proper Permissions**: For creating evaluation jobs and accessing S3
  - _Note: In the hosted workshop, these roles are pre-configured for you_
  - _For self-paced learners: Follow the workshop instructions to create the required role_
- **S3 Bucket**: For storing evaluation datasets and results
  - _Note: The workshop creates this automatically with the correct permissions_
  - _For self-paced learners: Create your own bucket and update the bucket name in this notebook_
- **Evaluation Models**: Access to models like Claude  Haiku/Sonnet, Amazon Nova

In [2]:
# AWS Configuration

account_id = boto3.client('sts').get_caller_identity().get('Account')

ROLE_ARN = f"arn:aws:iam::{account_id}:role/service-role/Bedrock-LLM-as-a-Judge-ExecutionRole"

BUCKET_NAME = f"genai-evaluation-migration-bucket-{account_id}"
PREFIX = "genai_migration"


In [3]:

### load our tracking df

evaluation_tracking_file = '../data/evaluation_tracking.csv'
evaluation_tracking = pd.read_csv(evaluation_tracking_file)
display(evaluation_tracking)

Unnamed: 0,model,model_clean_name,text_prompt,region,inference_profile,latency_evaluation_output
0,source_model,source_model,"\nFirst, please read the article below.\n{cont...",us-east-1,standard,../outputs/
1,amazon.nova-lite-v1:0,amazon.nova-lite-v1-0,## Instruction\nYour task is to read the given...,us-east-1,standard,../outputs/
2,us.anthropic.claude-3-5-haiku-20241022-v1:0,us.anthropic.claude-3-5-haiku-20241022-v1-0,<task>\nYour task is to provide an extremely c...,us-east-1,standard,../outputs/


## Data Preparation for Quality Evaluation

### Formatting Responses for LLM-as-a-Judge

To evaluate our model outputs using Bedrock's LLM-as-a-Judge, we need to format our data in a specific JSONL structure. Each record in this format contains:

| Field | Description | Required? | Our Usage |
|-------|-------------|-----------|-----------|
| `prompt` | The original input prompt | Required | The full summarization prompt with document |
| `referenceResponse` | Ground truth/reference answer | Optional | The human-written summary from our dataset |
| `category` | Domain category for specialized evaluation | Optional | "Summarization" for our use case |
| `modelResponses` | Array of model responses to evaluate | Required | Our model's generated summary |

Within `modelResponses`, we include:

- `response`: The actual text output from our model
- `modelIdentifier`: A unique identifier for the model (used for tracking)

This format allows the judge model to compare each response against both the original prompt and the reference answer, providing comprehensive quality assessment.

> **Note**: While you can evaluate multiple responses per prompt in other Bedrock evaluation methods, LLM-as-a-Judge supports only one response per prompt in each evaluation job. That's why we're creating separate jobs for each model.

For more details on the format requirements, see the [Bedrock documentation on evaluation datasets](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-prompt-datasets-judge.html).

Below, we'll define a function to convert our CSV results from Step 3 into this specialized JSONL format:


In [4]:

def prepare_inference_response(input_file, output_file, model_clean_name):
    """
    Convert a CSV file with call transcript data to JSONL format with specific structure, for LLM-AS-A-JUDGE.
    """

    random_sample = df.sample(n=10)

    # Initialize a list to store all JSONL entries
    jsonl_entries = []
    
    # Process each row individually without grouping
    for idx, row in random_sample.iterrows():
        prompt = row['prompt']
        
            
        # Create the entry for this row
        entry = {
            "prompt": prompt,  # Store the full prompt as is
            "category": "Summarization",
            "referenceResponse":row['referenceResponse'],
            "modelResponses": [{
                "response": row['model_response'],
                "modelIdentifier": model_clean_name
            }]
        }
            
        jsonl_entries.append(entry)
    
    # Write to JSONL file
    with open(output_file, 'w') as f:
        for entry in jsonl_entries:
            f.write(json.dumps(entry) + '\n')
    
    print(f"Conversion complete. JSONL file saved to {output_file}")
    print(f"Total entries: {len(jsonl_entries)}")




## S3 Integration for Evaluation Datasets

### Managing Evaluation Data in the Cloud

Our evaluation jobs run as asynchronous processes in AWS Bedrock, which requires our datasets to be accessible via S3. 

### Upload Process

We'll implement a function to upload our formatted JSONL files to S3 with proper error handling. This function will:

- Take a local file path, target bucket, and desired S3 key
- Handle the upload process with proper error checking
- Provide confirmation when the upload succeeds

> **Note for Self-Paced Users**: If you're running this workshop in your own environment, ensure your IAM role or user has appropriate permissions (s3:PutObject) for your target bucket. You can verify permissions by checking the IAM policy attached to your role or user.

In [5]:
def upload_to_s3(local_file, bucket, s3_key):
    """
    Upload a file to S3 with error handling.
    """
    try:
        s3_client.upload_file(local_file, bucket, s3_key)
        print(f"✓ Successfully uploaded to s3://{bucket}/{s3_key}")
        return True
    except Exception as e:
        print(f"✗ Error uploading to S3: {str(e)}")
        return False



## Preparing and Uploading Evaluation Datasets

### Converting Model Responses to Evaluation Format

Now we'll convert our model response data into properly formatted evaluation datasets and upload them to S3. For each model:

1. We'll locate the appropriate CSV files containing model responses
2. Format them according to LLM-as-a-Judge requirements
3. Upload the formatted data to our S3 bucket
4. Track the S3 locations in our evaluation tracking dataframe

This process creates separate evaluation datasets for each candidate model, allowing us to run parallel quality evaluations while maintaining clear organization of our test data.


In [6]:
## tracking
evaluation_tracking['quality_evaluation_input'] = ""

for index, evaluation in evaluation_tracking.iterrows():
    model_id = evaluation['model']
    model_clean_name = evaluation['model_clean_name']
    latency_evaluation_output = evaluation['latency_evaluation_output']

    outputs = glob.glob(os.path.join(latency_evaluation_output, f"document_summarization_{model_id}*.csv"))

    quality_evaluation_file = f"../outputs/quality_evaluation.{model_id}.jsonl"
    
    ## if experiment_counts >1, there will be more than 1 output
    document_summarization_df_list = []
        
    for output in outputs:
        df = pd.read_csv(output)
        document_summarization_df_list.append(df)
        
    document_summarization_df = pd.concat(document_summarization_df_list, axis=0, ignore_index=True) 

    ## generate a jsonl file for LLM-AS-A-JUDGE evaluation
    prepare_inference_response(document_summarization_df, quality_evaluation_file, model_clean_name)

    ## upload to s3
    s3_key = f"{PREFIX}/quality_evaluation.{model_clean_name}.jsonl"
    upload_success = upload_to_s3(quality_evaluation_file, BUCKET_NAME, s3_key)

    if not upload_success:
        raise Exception("Failed to upload dataset to S3")
   
    ## record the s3 path, we will need them to create LLM-AS-A-JUDGE evaluation jobs 
    evaluation_tracking.loc[evaluation_tracking['model'] == model_id, 'quality_evaluation_input'] = f"s3://{BUCKET_NAME}/{s3_key}"
    




Conversion complete. JSONL file saved to ../outputs/quality_evaluation.source_model.jsonl
Total entries: 10
✓ Successfully uploaded to s3://genai-evaluation-migration-bucket-339712833052/genai_migration/quality_evaluation.source_model.jsonl
Conversion complete. JSONL file saved to ../outputs/quality_evaluation.amazon.nova-lite-v1:0.jsonl
Total entries: 10
✓ Successfully uploaded to s3://genai-evaluation-migration-bucket-339712833052/genai_migration/quality_evaluation.amazon.nova-lite-v1-0.jsonl
Conversion complete. JSONL file saved to ../outputs/quality_evaluation.us.anthropic.claude-3-5-haiku-20241022-v1:0.jsonl
Total entries: 10
✓ Successfully uploaded to s3://genai-evaluation-migration-bucket-339712833052/genai_migration/quality_evaluation.us.anthropic.claude-3-5-haiku-20241022-v1-0.jsonl


## Evaluation Job Configuration

### Configuring Comprehensive Quality Metrics

Now we'll configure the LLM-as-Judge evaluation with focused metrics for assessing model performance. The Bedrock evaluation service supports numerous dimensions for quality assessment:

| Metric Category | Description |
|----------------|-------------|
| Quality | Correctness, Completeness, Faithfulness |
| User Experience | Helpfulness, Coherence, Relevance |
| Instructions | Following Instructions, Professional Style |
| Safety | Harmfulness, Stereotyping, Refusal |

For this workshop, we're focusing on core quality metrics to balance evaluation depth with execution time. In a production evaluation, you might expand to include more dimensions based on your specific use case requirements.

> **Workshop Note**: LLM-as-Judge evaluation takes time to complete. For this workshop, we're measuring a subset of metrics: `Builtin.Correctness`, `Builtin.Completeness`, and `Builtin.ProfessionalStyleAndTone`. In production scenarios, you might include more metrics or create your own metrics for comprehensive evaluation.


In [7]:
def create_llm_judge_evaluation(
    client,
    job_name: str,
    role_arn: str,
    input_s3_uri: str,
    output_s3_uri: str,
    model_clean_name: str,
    evaluator_model_id: str,
    dataset_name: str = None,
    task_type: str = "General" # must be General for LLMaaJ
):    
    # All available LLM-as-judge metrics
    llm_judge_metrics = [
        "Builtin.Correctness",
         "Builtin.Completeness", 
        # "Builtin.Faithfulness",
        # "Builtin.Helpfulness",
        # "Builtin.Coherence",
        # "Builtin.Relevance",
        # "Builtin.FollowingInstructions",
         "Builtin.ProfessionalStyleAndTone"
        # "Builtin.Harmfulness",
        # "Builtin.Stereotyping",
        # "Builtin.Refusal"
    ]

    # Configure dataset
    dataset_config = {
        "name": dataset_name or "CustomDataset",
        "datasetLocation": {
            "s3Uri": input_s3_uri
        }
    }

    try:
        response = client.create_evaluation_job(
            jobName=job_name,
            roleArn=role_arn,
            applicationType="ModelEvaluation",
            evaluationConfig={
                "automated": {
                    "datasetMetricConfigs": [
                        {
                            "taskType": task_type,
                            "dataset": dataset_config,
                            "metricNames": llm_judge_metrics
                        }
                    ],
                    "evaluatorModelConfig": {
                        "bedrockEvaluatorModels": [
                            {
                                "modelIdentifier": evaluator_model_id
                            }
                        ]
                    }
                }
            },
            inferenceConfig={
                "models": [
                    {
                        'precomputedInferenceSource': {
                            'inferenceSourceIdentifier': model_clean_name
                        }
                    }
                ]
            },
            outputDataConfig={
                "s3Uri": output_s3_uri
            }
        )
        return response
        
    except Exception as e:
        print(f"Error creating evaluation job: {str(e)}")
        raise

## Creating Evaluation Jobs

### Launching Quality Assessment for Each Model

With our evaluation datasets prepared and uploaded to S3, we're now ready to create separate evaluation jobs for each model.

Each evaluation job will:
- Use an advanced foundation model as the evaluator (Amazon Nova Pro in this case)
- Apply the same evaluation criteria across all models
- Generate standardized outputs for consistent comparison
- Store results in organized S3 locations for later analysis

The execution of these jobs may take several minutes as the evaluator model carefully assesses each response across multiple dimensions.


In [8]:
output_path = f"s3://{BUCKET_NAME}/{PREFIX}"


evaluation_tracking['quality_evaluation_jobArn'] = ""
evaluation_tracking['quality_evaluation_output'] = ""

for index, evaluation in evaluation_tracking.iterrows():
    model_id = evaluation['model']
    model_clean_name = evaluation['model_clean_name']
    quality_evaluation_input = evaluation['quality_evaluation_input']

    evaluator_model = "amazon.nova-pro-v1:0" #https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-judge.html

    if model_id == "source_model":
        model_id = model_id.replace("_", "-")
    
    job_name = f"llmaaj-{model_id.split('.')[0]}-{evaluator_model.split('.')[0]}-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
    

    # Create evaluation job
    try:
        llm_as_judge_response = create_llm_judge_evaluation(
            client=bedrock_client,
            job_name=job_name,
            role_arn=ROLE_ARN,
            input_s3_uri=quality_evaluation_input,
            output_s3_uri=output_path,
            model_clean_name = model_clean_name,
            evaluator_model_id=evaluator_model,
            task_type="General"
        )
        print(f"✓ Created evaluation job: {llm_as_judge_response['jobArn']}")

        ## record the s3 path, we will need them to retrieve LLM-AS-A-JUDGE evaluation jobs 
        evaluation_tracking.loc[evaluation_tracking['model_clean_name'] == model_clean_name, 'quality_evaluation_jobArn'] = llm_as_judge_response['jobArn']
        evaluation_tracking.loc[evaluation_tracking['model_clean_name'] == model_clean_name, 'quality_evaluation_output'] = f"{output_path}/{job_name}/"

    
    except Exception as e:
        print(f"✗ Failed to create evaluation job: {str(e)}")
        raise
        

✓ Created evaluation job: arn:aws:bedrock:us-east-1:339712833052:evaluation-job/bcly6857m113
✓ Created evaluation job: arn:aws:bedrock:us-east-1:339712833052:evaluation-job/3zkkftou112q
✓ Created evaluation job: arn:aws:bedrock:us-east-1:339712833052:evaluation-job/z33p2of2ey1o


## Tracking Evaluation Progress

### Saving Job References for Analysis

Now that we've initiated all our evaluation jobs, we'll save our tracking information to ensure we can locate and analyze the results in the next notebook. This tracking includes:

- Job ARNs for monitoring status and retrieving results
- S3 output locations where results will be stored
- Metadata about each evaluation job for reference

This information forms the bridge between our evaluation execution and the final analysis step, ensuring we can efficiently access all the data needed for our comprehensive model comparison.


In [9]:
## saving the progress

evaluation_tracking.to_csv(evaluation_tracking_file, index=False)

## Summary and Next Steps

### What We've Accomplished

In this notebook, we've successfully:

1. ✅ **Prepared** evaluation datasets in the correct format for quality assessment
2. ✅ **Uploaded** these datasets to S3 for processing by Bedrock evaluation jobs
3. ✅ **Configured** comprehensive quality metrics focused on correctness, completeness, and style
4. ✅ **Created** separate evaluation jobs for each candidate model 
5. ✅ **Tracked** job references and output locations for subsequent analysis


### Next Steps


In the next notebook, **Step 5 - Compare Models**, we'll consolidate all our evaluation data across the three critical dimensions: quality, latency, and cost. We'll analyze the results from our evaluation jobs, create visualizations to highlight the trade-offs between different models, and produce a comprehensive migration recommendation based on our defined success criteria.

The final analysis will provide a data-driven basis for selecting the optimal model for your production deployment, balancing performance, efficiency, and cost effectiveness.

> <span style="color:red"> **Note**: You can navigate the the [Bedrock console](https://us-east-1.console.aws.amazon.com/bedrock/home?region=us-east-1#/eval/evaluation) page to view the LLM-as-a-Judge Evaluation progress.
While our evaluation jobs continue to run, we can proceed to the next notebook to set up our analysis framework. The final results will be incorporated once the jobs complete.</span>
