# Step 5 - Compare Models

## Introduction

In this final notebook, we'll bring together all the metrics collected throughout our migration evaluation process to make data-driven decisions about model selection. Building on the latency measurements from Step 3 and quality assessments from Step 4, we'll now add cost projections and generate comprehensive visualizations to compare our candidate models.

Whether you're migrating between foundation models or selecting a model for a new application, it's essential to consider multiple dimensions simultaneously.Our analysis framework helps you visualize and quantify these considerations, providing clear guidance for your model selection decision regardless of your starting point.


### What You'll Get From This Analysis

This notebook produces two key artifacts:

1. **PDF Analysis Report**: A comprehensive document with visualizations comparing models across all dimensions
2. **CSV Summary**: Raw data for further analysis or integration with other business metrics

Let's begin consolidating our evaluation results!


## Evaluation Tools and Utilities

### Leveraging Pre-built Analysis Components

To streamline our model comparison process, we'll use two specialized Python classes created for this workshop:

1. **`pricing`**: A utility class that enables accurate cost calculations based on token usage
  
2. **`generate_analysis_report`**: A function that produces professional PDF reports

These utilities encapsulate complex functionality to let us focus on analysis rather than implementation details. The complete source code for these classes is available in the `../src` directory and can be customized for your specific evaluation needs beyond this workshop.

> **Workshop Note**: When adapting this notebook for your own migration projects, consider extending these classes to incorporate additional metrics specific to your use case.


In [1]:
import json
import boto3
import numpy as np
from scipy import stats
import pandas as pd
import glob
import os
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
from matplotlib.backends.backend_pdf import PdfPages
import datetime
import sys
from IPython.display import display

sys.path.append("../")

from src import pricing
from src import generate_analysis_report


# AWS Configuration

account_id = boto3.client('sts').get_caller_identity().get('Account')

BUCKET_NAME = f"genai-evaluation-migration-bucket-{account_id}"
PREFIX = "genai_migration"



## Loading Evaluation Data

### Retrieving Our Model Tracking Information

Before we can analyze our models, we need to access the tracking information maintained throughout our evaluation process.
By loading this tracking information first, we establish a foundation for our consolidated analysis and ensure we're working with consistent model metadata throughout our comparison process.

> **Note for Self-Paced Learners**: If you've modified any paths or filenames during previous steps, ensure those changes are reflected in your tracking file path below. The evaluation process relies on consistent tracking information across all notebooks.


In [2]:
### load our tracking df

evaluation_tracking_file = '../data/evaluation_tracking.csv'
evaluation_tracking = pd.read_csv(evaluation_tracking_file)
display(evaluation_tracking)

Unnamed: 0,model,model_clean_name,text_prompt,region,inference_profile,latency_evaluation_output,quality_evaluation_input,quality_evaluation_jobArn,quality_evaluation_output
0,source_model,source_model,"\nFirst, please read the article below.\n{cont...",us-east-1,standard,../outputs/,s3://genai-evaluation-migration-bucket-3397128...,arn:aws:bedrock:us-east-1:339712833052:evaluat...,s3://genai-evaluation-migration-bucket-3397128...
1,amazon.nova-lite-v1:0,amazon.nova-lite-v1-0,## Instruction\nYour task is to read the given...,us-east-1,standard,../outputs/,s3://genai-evaluation-migration-bucket-3397128...,arn:aws:bedrock:us-east-1:339712833052:evaluat...,s3://genai-evaluation-migration-bucket-3397128...
2,us.anthropic.claude-3-5-haiku-20241022-v1:0,us.anthropic.claude-3-5-haiku-20241022-v1-0,<task>\nYour task is to provide an extremely c...,us-east-1,standard,../outputs/,s3://genai-evaluation-migration-bucket-3397128...,arn:aws:bedrock:us-east-1:339712833052:evaluat...,s3://genai-evaluation-migration-bucket-3397128...


## Cost Analysis Preparation

### Calculating Per-Request and Projected Costs

While we're waiting for our LLM-as-a-Judge quality evaluation jobs to complete, we can analyze the economic impact of each model option. Cost considerations are crucial for production deployments, as even small per-request differences can result in significant operational expense at scale.

Our cost analysis includes two key components:

1. **Per-Inference Cost**: Calculated based on token usage from our latency evaluation
cost_per_inference = (input_tokens × input_token_price) + (output_tokens × output_token_price)
This cost information will be crucial when making the final model selection, allowing us to balance performance and quality against budgetary constraints. In production environments, this analysis can be extended to include projected monthly costs based on expected request volumes.

2. **Auxiliary Service Costs**: Additional expenses beyond direct model inference
- **LLM-as-a-Judge**: Charged based on evaluator model usage
- **Prompt Optimization**: Charged per token for input and optimized prompts



> **Workshop Note**: AWS Bedrock pricing updates periodically. For the most current pricing, refer to the [Bedrock Pricing Page](https://aws.amazon.com/bedrock/pricing/).


In [3]:
directory = "../outputs"

In [4]:
calculator = pricing.PriceCalculator()

# Find all matching CSV files
all_files = glob.glob(os.path.join(directory, "document_summarization_*.csv"))

# Process each file
for filename in all_files:

    
    print(f"\nProcessing {os.path.basename(filename)}...")
    
    # Read the CSV file
    document_summarization_df = pd.read_csv(filename)


    model_id = document_summarization_df["model"][0]  ## change the name back to match the pricing config
    #print(model_id)
    
    print(f"Total rows: {len(document_summarization_df)}")
    

    # Calculate input costs for all rows at once
    input_costs = document_summarization_df["model_input_tokens"].apply(
        lambda tokens: calculator.calculate_input_price(tokens, model_id)
    )

    # Calculate output costs for all rows at once
    output_costs = document_summarization_df["model_output_tokens"].apply(
        lambda tokens: calculator.calculate_output_price(tokens, model_id)
    )
    
    # Calculate total costs
    document_summarization_df["cost"] = (input_costs + output_costs).round(6)


    # Write back to the same file
    document_summarization_df.to_csv(filename, index=False)
    #print(f"Updated and saved {filename}")



Processing document_summarization_us.anthropic.claude-3-5-haiku-20241022-v1:0_20250624_003534.csv...
Total rows: 10

Processing document_summarization_source_model.csv...
Total rows: 73

Processing document_summarization_amazon.nova-lite-v1:0_20250624_003419.csv...
Total rows: 10


## Prompt Optimization Economics

### Calculating the Cost of Automated Prompt Improvement

In addition to model inference and quality evaluation, our migration process leverages Amazon Bedrock's Prompt Optimization service to improve prompt effectiveness across models. Like other AI services, this capability has its own pricing structure that should be considered in your total migration cost.

#### Understanding Prompt Optimization Pricing

Bedrock charges for prompt optimization based on token volume:

- **Rate**: \$0.030 per 1,000 tokens
- **Counted Tokens**: Both input prompts and optimized output prompts
- **Billing Cycle**: Monthly, based on total token usage

This pricing model means that costs scale with both the size of your prompts and the number of optimizations you perform. For most migration scenarios, this represents a small one-time cost, but it's still valuable to estimate for complete budget planning.

#### Calculation Example

To illustrate how prompt optimization costs work in practice, consider this example:

An application developer optimizes a news summarization prompt originally written for Claude 3.5:
- Original prompt: 429 tokens
- Optimized prompt for Claude 3.5: 511 tokens
- This optimized prompt is then used as input to generate variants for:
  - Claude 3.7: 582 tokens
  - Nova Pro: 579 tokens

**Token calculation**:
- Input tokens: 429 + 511 + 511 = 1,451 tokens
- Output tokens: 511 + 582 + 579 = 1,672 tokens
- Total tokens: 3,123 tokens

**Cost calculation**:
3,123 tokens ÷ 1,000 × \$0.03 = \$0.09

For our workshop scenario, prompt optimization represents a minimal expense compared to ongoing inference costs, but tracking it provides a complete picture of migration economics.

> **Workshop Note**: In production systems, prompt optimization might be performed periodically as models evolve or requirements change. While the per-optimization cost is low, enterprises with many different prompts should account for this in their operational budgets.


In [5]:
### total tokens for prompt optimization:

input_prompt_len = 0
optimized_prompts = []
for index, evaluation in evaluation_tracking.iterrows():
    model_id = evaluation['model']
    if model_id == "source_model":
        input_prompt_len = len(evaluation['text_prompt'])
    else:
        optimized_prompts.append(evaluation['text_prompt']) 

total_prompt_len = sum(len(prompt) for prompt in optimized_prompts) + input_prompt_len * len(optimized_prompts)

prompt_optimization_cost = total_prompt_len/4/1000 * 0.03

print(f"Estimated cost for prompt optimization: ${prompt_optimization_cost}")

Estimated cost for prompt optimization: $0.0134025


## Understanding LLM-as-a-Judge Costs

### Breaking Down Evaluation Economics

Beyond the direct cost of model inference, it's important to account for the expense of quality evaluation itself. LLM-as-a-Judge is a powerful evaluation method, but it also has associated costs that should be factored into your migration planning.

#### Pricing Components

Bedrock charges for LLM-as-a-Judge evaluations based on the following components:

1. **Model Inference**: The primary cost is for the evaluator model's usage
   - Automatically-generated algorithmic scores are provided without additional charges
   - For human-based evaluation with your own workstream, there's a \$0.21 charge per completed human task

2. **Token Consumption**: Each evaluation involves several components:
   - **Judge Prompts**: Each metric/evaluator uses its own [specialized prompt](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-judge-prompt-nova.html) (~300 tokens per metric)
   - **Input Content**: The original prompts and model responses being evaluated
   - **Output Results**: JSON output with evaluation scores (~20 tokens per metric)

#### Cost Calculation Formula

For accurate budgeting, we can estimate evaluation costs using this formula for each metric:

Evaluation Cost = [((Tokens in prompts) + (Tokens in responses) + (Judge prompt tokens))/1000 × (Evaluator input token price)] + [(Output tokens)/1000 × (Evaluator output token price)]


Where:
- Input tokens = Number of tokens in your prompts + responses + judge prompts (typically ~300 tokens)
- Output tokens = Number of tokens in the evaluator's output (typically ~20 tokens per metric)

This detailed understanding of evaluation costs helps build a complete economic picture when comparing different model options and planning for ongoing quality assessment in production.

> **Workshop Note**: When designing your evaluation strategy, consider the trade-off between comprehensive evaluation (using many metrics) and cost efficiency. For routine evaluations, you might select a smaller subset of critical metrics to control costs.


In [6]:
evaluator_id = "amazon.nova-pro-v1:0"

evaluator_input_price = calculator.model_input_token_prices.get(evaluator_id)
evaluator_output_price = calculator.model_output_token_prices.get(evaluator_id)

print(f"Evaluator model input price: ${evaluator_input_price}; output price: ${evaluator_output_price}")

Evaluator model input price: $0.0008; output price: $0.0002


In [7]:
### Double check your evaluation metrics. By default our workshop evaluate 3 metrics: "Builtin.Correctness", "Builtin.Completeness",  "Builtin.ProfessionalStyleAndTone"


# Find all matching json files
quality_evaluation_inputs = glob.glob(os.path.join(directory, "quality_evaluation*"))

# Process each file
total_quality_evaluation_cost = 0

for filename in quality_evaluation_inputs:
    #print(filename)    

    evaluation_price = 0
    with open(filename, 'r') as f:
        for line in f:
            data = json.loads(line)
            
            prompt_length = len(data["prompt"])
            reference_length = len(data["referenceResponse"])
            model_response_length = len(data["modelResponses"][0]["response"])
            
            # Calculate tokens according to the formula
            input_tokens = (prompt_length + reference_length + model_response_length) / 4 + 300 ## 300 is a rough estimation, see details in above
            output_tokens = 20 ## it is a rough estimation
            
            # Calculate price for this entry
            entry_price = (input_tokens/1000 * evaluator_input_price) + (output_tokens/1000 * evaluator_output_price)
            evaluation_price += entry_price
            
            #print(f"Entry: Input tokens = {input_tokens:.2f}, Output tokens = {output_tokens}")
    
    print(f"Estimated cost per metric to evaluate {filename}: ${evaluation_price:.6f}\n")
    total_quality_evaluation_cost +=  evaluation_price


## We evaluated 3 metrics: "Builtin.Correctness", "Builtin.Completeness","Builtin.ProfessionalStyleAndTone"
total_quality_evaluation_cost = total_quality_evaluation_cost * 3

print(f"Estimated total cost to evaluate: ${total_quality_evaluation_cost:.6f}")
    





Estimated cost per metric to evaluate ../outputs/quality_evaluation.source_model.jsonl: $0.007137

Estimated cost per metric to evaluate ../outputs/quality_evaluation.us.anthropic.claude-3-5-haiku-20241022-v1:0.jsonl: $0.013481

Estimated cost per metric to evaluate ../outputs/quality_evaluation.amazon.nova-lite-v1:0.jsonl: $0.013224

Estimated total cost to evaluate: $0.101527


## Consolidating Latency and Inference Cost Metrics

In this section, we'll import and process the latency data collected during Step 3. This data is crucial for understanding the performance characteristics of each model, particularly for applications with real-time requirements or interactive user experiences.

By comparing these metrics across models, we can identify performance differences that might impact user experience in production environments. This is especially important for applications with strict latency requirements or those serving large numbers of concurrent users.


In [8]:
def combine_latency_evaluation_files(directory):
    """Combine all CSV files in the directory into a single DataFrame."""
    all_files = glob.glob(os.path.join(directory, "document_summarization_*.csv"))
    df_list = []
    for filename in all_files:
        df = pd.read_csv(filename)
        df_list.append(df)
    return pd.concat(df_list, axis=0, ignore_index=True)

def calculate_metrics(df, group_columns):
    """Calculate latency metrics grouped by model, region, and inference profile."""
    metrics = df.groupby(group_columns).agg({
        'model_input_tokens': ['count', 'mean'],
        'model_output_tokens': ['mean'],
        'cost': ['mean'],
        'latency': ['mean', 'median', 
                             lambda x: x.quantile(0.9),lambda x: x.std()],
        'model_latencyMs': ['mean', 'median', 
                             lambda x: x.quantile(0.9),lambda x: x.std()]
    }).round(6)

    metrics.columns = ['sample_size', 
                      'avg_input_tokens',
                      'avg_output_tokens',
                       'avg_cost',
                       'latency_mean', 'latency_p50', 'latency_p90', 'latency_std',
                      'model_latencyMs_mean', 'model_latencyMs_p50', 'model_latencyMs_p90', 'model_latencyMs_std']
    
    metrics = metrics.reset_index()
    #print(metrics)
    
    ### add the quality metrics
    for model in metrics["model"]:
        print(model)
    return metrics


In [9]:

latency_evaluation_raw = combine_latency_evaluation_files(directory)
metrics = calculate_metrics(latency_evaluation_raw, ['model', 'region', 'inference_profile'])

display(metrics)

amazon.nova-lite-v1:0
source_model
us.anthropic.claude-3-5-haiku-20241022-v1:0


Unnamed: 0,model,region,inference_profile,sample_size,avg_input_tokens,avg_output_tokens,avg_cost,latency_mean,latency_p50,latency_p90,latency_std,model_latencyMs_mean,model_latencyMs_p50,model_latencyMs_p90,model_latencyMs_std
0,amazon.nova-lite-v1:0,us-east-1,standard,10,1127.4,43.8,7.8e-05,0.49,0.485,0.64,0.122384,479.5,477.5,626.6,123.810114
1,source_model,us-east-1,standard,73,656.438356,90.493151,0.000436,1.65863,1.55,2.178,0.497715,1518.643836,1469.0,2031.4,418.496328
2,us.anthropic.claude-3-5-haiku-20241022-v1:0,us-east-1,standard,10,1277.6,73.0,0.001314,2.496,2.405,3.151,0.681929,2486.7,2396.5,3148.1,682.633308


## Consolidating Quality Metrics Integration

With our latency and cost metrics prepared, we now need to incorporate the quality evaluation results from our LLM-as-a-Judge jobs. Before we can analyze these results, we need to:
1. Verify that all evaluation jobs have completed
2. Locate the output files in our S3 bucket
3. Extract and parse the evaluation scores for each model

This quality data completes our three-dimensional view of model performance (latency, cost, and quality), enabling truly informed selection decisions.

> <span style="color:red">**Workshop Note**: Evaluation jobs may take several minutes to complete. If your jobs are still running, you can monitor their status in the AWS console or wait for completion before proceeding.</span>


In [10]:
bedrock_client = boto3.client('bedrock')
s3_client = boto3.client('s3')

In [12]:
import re
from collections import defaultdict

def get_s3_output_keys(evaluation_tracking):
    # Initialize result structure
    result = {
        "model": [],
        "key": []
    }
    
    for index, evaluation in evaluation_tracking.iterrows():
        model_id = evaluation['model']
        evaluation_job_arn = evaluation['quality_evaluation_jobArn']

        # Check job status
        check_status = bedrock_client.get_evaluation_job(jobIdentifier=evaluation_job_arn)
        print(f"{model_id}: {check_status['status']}")
        
        if check_status['status'] == "Completed":
            output_path = evaluation['quality_evaluation_output']
            try:
                response = s3_client.list_objects_v2(
                    Bucket=BUCKET_NAME,
                    Prefix=PREFIX
                )

                # Find the JSONL output file for this model
                for obj in response.get('Contents', []):
                    key = obj['Key']
                    # Add model identifier check
                    if key.endswith('_output.jsonl') and model_id.replace(':', '-') in key:
                        result["model"].append(model_id)
                        result["key"].append(key)
                        break
            
            except Exception as e:
                print(f"Error listing objects for {model_id}: {str(e)}")
        else:
            print("\x1b[31mQuality evaluation job is still in progress, please wait..\x1b[0m")
            sys.exit()
    
    return result
    
    return result

# Usage
s3_output_keys = get_s3_output_keys(evaluation_tracking)

# Now s3_output_keys contains a dictionary mapping model IDs to their S3 output keys
print("\nAutomatically detected S3 output keys:")
for model, key in s3_output_keys.items():
    print(f"  {model}: {key}")

source_model: Completed
amazon.nova-lite-v1:0: Completed
us.anthropic.claude-3-5-haiku-20241022-v1:0: Completed

Automatically detected S3 output keys:
  model: ['source_model', 'amazon.nova-lite-v1:0', 'us.anthropic.claude-3-5-haiku-20241022-v1:0']
  key: ['genai_migration/llmaaj-source-model-amazon-2025-06-24-00-36-39/bcly6857m113/models/source_model/taskTypes/General/datasets/CustomDataset/30cc6f66-297c-4afd-afe2-b2864f3a1eb8_output.jsonl', 'genai_migration/llmaaj-amazon-amazon-2025-06-24-00-36-41/3zkkftou112q/models/amazon.nova-lite-v1-0/taskTypes/General/datasets/CustomDataset/422f0852-53f3-4bea-aa1e-a655f633a8e5_output.jsonl', 'genai_migration/llmaaj-us-amazon-2025-06-24-00-36-43/z33p2of2ey1o/models/us.anthropic.claude-3-5-haiku-20241022-v1-0/taskTypes/General/datasets/CustomDataset/90e82d7a-95c4-4b5c-a35d-c1d52ce2a692_output.jsonl']


In [13]:
### quality metrics

file_key_df = pd.DataFrame(s3_output_keys)
model_quality_list = []

# print(file_key_df)

for index, row in file_key_df.iterrows():  
    metrics_dict = {}

    model = row["model"]
    if row["key"] == "":
        continue
    response = s3_client.get_object(Bucket=BUCKET_NAME, Key=row["key"])
    content = response['Body'].read().decode('utf-8')
    
    for line in content.strip().split('\n'):
        if line:
            data = json.loads(line)
            if 'automatedEvaluationResult' in data and 'scores' in data['automatedEvaluationResult']:
                for score in data['automatedEvaluationResult']['scores']:
                    metric_name = score['metricName']
                    if 'result' in score:
                        metric_value = score['result']
                        if metric_name not in metrics_dict:
                            metrics_dict[metric_name] = []
                        metrics_dict[metric_name].append(metric_value)
    
    df = pd.DataFrame(metrics_dict)
    df['model'] = model
    model_quality_average = df.groupby("model").mean()
    model_quality_average = model_quality_average.reset_index()
    model_quality_list.append(model_quality_average)


model_quality = pd.concat(model_quality_list, axis=0, ignore_index=True)
print(model_quality)

                                         model  Builtin.Correctness  \
0                                 source_model                 0.95   
1                        amazon.nova-lite-v1:0                 1.00   
2  us.anthropic.claude-3-5-haiku-20241022-v1:0                 1.00   

   Builtin.Completeness  Builtin.ProfessionalStyleAndTone  
0                 0.925                               1.0  
1                 0.900                               1.0  
2                 1.000                               1.0  


>**Workshop Note**: These evaluation results may not be able to differentiate enough between the models. That's because we have a really small sample size (n=10), for the sake of time. In real-world implementation, you will need to increase the sample size to accurately measure the quality of models.

## Consolidated Metrics Analysis

### Combining Performance, Quality, and Cost Data

With our quality metrics now added to our performance and cost data, we have a complete view of each model's capabilities. This consolidated metrics framework lets us:

1. **Compare trade-offs**: See how models balance speed, quality, and cost
2. **Identify strengths**: Determine which models excel in specific dimensions
3. **Match to requirements**: Align capabilities with application-specific priorities

This multidimensional analysis is essential for making informed decisions that go beyond simplistic "best model" thinking. Different applications have unique requirements - a customer service chatbot might prioritize response quality and tone, while a high-volume processing application might favor speed and cost efficiency.

The merged metrics dataframe we've created serves as the foundation for our visualizations and final report, providing a clear picture of the relative advantages of each model option.


In [14]:
## merge metrics to  metrics df
metrics = pd.merge(metrics, model_quality, on=['model'])
display(metrics)

Unnamed: 0,model,region,inference_profile,sample_size,avg_input_tokens,avg_output_tokens,avg_cost,latency_mean,latency_p50,latency_p90,latency_std,model_latencyMs_mean,model_latencyMs_p50,model_latencyMs_p90,model_latencyMs_std,Builtin.Correctness,Builtin.Completeness,Builtin.ProfessionalStyleAndTone
0,amazon.nova-lite-v1:0,us-east-1,standard,10,1127.4,43.8,7.8e-05,0.49,0.485,0.64,0.122384,479.5,477.5,626.6,123.810114,1.0,0.9,1.0
1,source_model,us-east-1,standard,73,656.438356,90.493151,0.000436,1.65863,1.55,2.178,0.497715,1518.643836,1469.0,2031.4,418.496328,0.95,0.925,1.0
2,us.anthropic.claude-3-5-haiku-20241022-v1:0,us-east-1,standard,10,1277.6,73.0,0.001314,2.496,2.405,3.151,0.681929,2486.7,2396.5,3148.1,682.633308,1.0,1.0,1.0


## AI-Enhanced Analysis Summary

### Using LLMs to Interpret Evaluation Results

One of the powerful capabilities of advanced LLMs is their ability to analyze complex data and generate insightful summaries. In this section, we'll leverage this capability by asking Claude Haiku to interpret our evaluation results and provide a concise summary of the key findings.

This approach demonstrates an important pattern for working with foundation models: using them not just as content generators but as analytical tools that can extract insights from structured data. 

The resulting summary provides a human-readable interpretation of our metrics, complementing our visualizations and raw data with narrative insights.

> **Workshop Note**: This pattern of "LLM as data analyst" can be applied to many business intelligence scenarios beyond model evaluation. Consider how your organization might leverage foundation models to generate insights from other complex datasets.


In [15]:

client = boto3.client("bedrock-runtime")


In [17]:
## generate an analysis summary

summary_prompt = """Using the dataset provided below, create a concise 3-sentence summary that identifies which model performs best in each of these three metric categories:
1. Latency metrics (latency_mean, latency_p50, latency_p90)
2. Cost metrics (avg_cost)
3. Quality metrics (Builtin.Correctness and output completeness as suggested by avg_output_tokens)
4. Don't start with  "Based on the provided...", just give your summary
The summary should clearly state which model is optimal for users prioritizing speed, cost-efficiency, or output quality. 

Dataset: {dataset}
"""

max_tokens=1000
temperature=0
top_p=0.9

response = client.converse(
    modelId="us.anthropic.claude-3-5-haiku-20241022-v1:0",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "text": summary_prompt.format(dataset=metrics)
                }
            ]
        }
    ],
    inferenceConfig={
        "temperature": temperature,
        "maxTokens": max_tokens,
        "topP": top_p
    }
)

analysis_summary = response['output']['message']['content'][0]['text']

print(analysis_summary)

For latency metrics, Amazon Nova Lite emerges as the fastest model with the lowest mean latency (0.49s), p50 (0.485s), and p90 (0.640s) times. In terms of cost-efficiency, Amazon Nova Lite offers the most economical solution with the lowest average cost of $0.000078 per inference. Regarding output quality, both Amazon Nova Lite and Claude 3.5 Haiku achieve perfect correctness (1.00), but Claude 3.5 Haiku demonstrates superior output completeness with a 1.000 score compared to Nova Lite's 0.900.


## Final Report Generation

### Creating Comprehensive Analysis Documentation

The culmination of our evaluation process is a professionally formatted report that presents our findings in a clear, visually appealing format. This report serves multiple purposes:

1. **Documentation**: Creates a permanent record of our evaluation methodology and results
2. **Communication**: Provides shareable artifacts for stakeholder discussions
3. **Decision Support**: Organizes information to facilitate informed choices

Our `generate_analysis_report` utility handles the complex work of:
- Creating consistent visualizations across dimensions
- Formatting tables for readability
- Generating performance distribution charts
- Including our AI-generated summary alongside raw metrics

The final output includes both a PDF report and a CSV summary, providing options for both high-level reviews and detailed analysis.

> **Workshop Note**: After running this cell, check the `../outputs-analysis/` directory to view your generated report. This report can serve as a template for your own model evaluation documentation.


In [18]:
report = generate_analysis_report.Analysis_Report()
report.generate_report(latency_evaluation_raw, directory, metrics, analysis_summary, total_quality_evaluation_cost, prompt_optimization_cost)


['amazon.nova-lite-v1:0' 'source_model'
 'us.anthropic.claude-3-5-haiku-20241022-v1:0']

Analysis complete!
PDF report saved to: ../outputs-analysis/Model_evaluation_analysis_report_20250624_013541.pdf
CSV summary saved to: ../outputs-analysis/analysis_summary_20250624_013541.csv


## Summary and Takeaways

### Congratulations on Completing the Model Evaluation Workshop!

You've successfully navigated the entire model evaluation and migration process, gaining hands-on experience with a methodology that can be applied to your own GenAI projects. In this final notebook, you've:

1. ✅ **Consolidated multi-dimensional metrics** across latency, quality, and cost
2. ✅ **Calculated economic implications** of different model choices at scale  
3. ✅ **Analyzed quality assessments** from LLM-as-a-Judge evaluations
4. ✅ **Generated AI-enhanced insights** to interpret complex evaluation data
5. ✅ **Created professional documentation** to support decision-making

### Key Takeaways

1. **Model Selection is Multi-Dimensional**: There's rarely a single "best" model - different models excel in different areas, and the optimal choice depends on your specific requirements.
2. **Data-Driven Migration**: Successful model migrations require objective measurements rather than assumptions or specifications alone.
3. **Trade-off Analysis**: Understanding the relationship between speed, quality, and cost allows for informed decisions that balance competing priorities.
4. **Documentation Matters**: Thorough documentation of your evaluation methodology and results helps build consensus and supports future migration decisions.
5. **Continuous Evaluation**: As new models are released and existing ones are updated, the evaluation process should be repeated periodically to ensure you're using optimal components.

### Next Steps

Consider applying this evaluation framework to your own use cases:
- **Customize success criteria**: Define specific thresholds for acceptance based on your application requirements
- **Widen the comparison**: Evaluate more models across different providers and architectures
- **Expand the metrics**: Add domain-specific evaluation criteria relevant to your applications
- **Continue prompt optimization**: Remember that prompt engineering is an ongoing effort - consider building a systematic pipeline for testing prompt variations with automated metrics or human-in-the-loop evaluation
- **Explore more Inference options**: Learn and explore more inference optimization options in Bedrock that may improve latency for your use cases, e.g.: [Optimized Inference](https://docs.aws.amazon.com/bedrock/latest/userguide/latency-optimized-inference.html), [Prompt Caching](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html), [Inference Profile](https://docs.aws.amazon.com/bedrock/latest/userguide/inference-profiles.html) and others.
- **Build RAG evaluation component**: If your use case requires a knowledge base built with RAG, consider adding a quality evaluation component for RAG. RAG evaluation options include [Bedrock RAG evaluator](https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-kb.html), open source solutions like [RAGAS](https://docs.ragas.io/en/stable/getstarted/rag_eval/). 
- **Automate the workflow**: Integrate these evaluation steps into your CI/CD pipeline for ongoing model assessment
- **Establish feedback loops**: Create mechanisms to capture production performance data to inform future prompt and model optimizations

Thank you for participating in this workshop! We hope the skills and methodology you've learned will help you make confident, data-driven decisions about model selection and migration in your GenAI journey.
