# Step 1 - Prepare Data and Define Success Criteria

## Introduction

In this initial phase, we'll establish the foundation for our model migration journey. Proper preparation is crucial for successful model transitions. We need to clearly define our business case, understand current performance, and establish measurable success criteria.

### Business Context

For this workshop, we're working with the following scenario:

- **Business Case:** A customer is using a proprietary LLM (our "source model") for document summarization, which generates concise abstracts from longer texts.
- **Pain Points:** The source model has become expensive to operate at scale, putting pressure on the operation budget.
- **Current Performance:** The model provides good quality summaries but at a high cost per inference.

### Success Criteria

For our migration to be considered successful, we need to meet or exceed these criteria:

| Dimension | Success Criteria | Measurement Method |
|-----------|------------------|-------------------|
| Quality | Similar or better accuracy in summarization | LLM-as-judge evaluation comparing summary quality |
| Latency | Similar or better latency | Direct measurement of response times |
| Cost | Lower overall cost per inference | Calculation based on token counts and pricing |

Let's begin by exploring our dataset and analyzing the source model's performance!

In [2]:
pip install --upgrade numexpr

Looking in indexes: https://pypi.org/simple, https://plugin.us-east-1.prod.workshops.aws
Collecting numexpr
  Downloading numexpr-2.11.0.tar.gz (108 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: numexpr
  Building wheel for numexpr (pyproject.toml) ... [?25ldone
[?25h  Created wheel for numexpr: filename=numexpr-2.11.0-cp310-cp310-linux_x86_64.whl size=149340 sha256=1b11388fc2f7cde048a4b8124c95df360a3a038b5a88bd93fb8ef7a0fc52b673
  Stored in directory: /home/ec2-user/.cache/pip/wheels/a7/d0/17/e38daa1110f54ba5f7330d38440f592c063251a6456053e2ed
Successfully built numexpr
Installing collected packages: numexpr
  Attempting uninstall: numexpr
    Found existing installation: numexpr 2.7.3
    Uninstalling numexpr-2.7.3:
      Successfully uninstalled numexpr-2.7.3
Successfully installed numexpr-2.11.0
Note: you may need to res

## Dataset Preparation

### About the Dataset

A high-quality evaluation dataset is critical for migration success. For this workshop, we're using the [EdinburghNLP/xsum](https://huggingface.co/datasets/EdinburghNLP/xsum) dataset, a well-established benchmark for text summarization that contains news articles paired with human-written summaries.

We've selected 10 representative samples from this dataset to keep the workshop computationally efficient. These samples represent various document lengths and complexity levels to provide a robust evaluation.

> **Note for Self-Paced Learners:** If you're running this workshop in your own environment, you're welcome to substitute your own domain-specific dataset. Just make sure it follows the same CSV format with `document` and `referenceResponse` columns. If you're interested in how we preprocessed the dataset, check the `src/dataset.py` file.

Let's examine a sample to understand what our data looks like:

In [None]:
# check test dataset
! head -5 ../data/document_sample_10.csv

## Source Model Analysis

### Baseline Performance Measurement

> **Note:** In a real migration scenario, you would collect these metrics from your production system. If you're building a new application rather than migrating, you might not have a source model to compare against.

A critical step in the migration process is establishing a reliable performance baseline for your source model. This baseline will serve as the comparison point for all candidate models.

For this workshop, we've already generated responses from our hypothetical "source model" using the dataset.



> **For Self-Paced Learners:** If you wish to use your own source model (e.g., OpenAI's GPT-4o-mini or another model), you can generate responses following the same format and replace the provided CSV file.

Let's examine one of the source model's responses to understand its performance characteristics:


In [3]:
# check responses from source model 

import pandas as pd

# Read the CSV file
csv_path = '../outputs/document_summarization_source_model.csv'
df = pd.read_csv(csv_path)

# Get and print the first row nicely
first_row = df.iloc[0]
print("\nFirst row contents:")
print("-" * 80)
for column, value in first_row.items():
    print(f"{column}: {value}")
print("-" * 80)


First row contents:
--------------------------------------------------------------------------------
model: source_model
region: us-east-1
invocation_id: 0
prompt: 
First, please read the article below.
BAE Systems blamed the closure plan for the former Vickers plant in Scotswood on fewer Ministry of Defence orders.
Now neighbouring defence and construction giant Reece Group has signed a deal with BAE to take on the factory for an undisclosed sum.
Reece plans to transfer work from existing sites into Scotswood.
Reece already employs about 500 people and hopes to expand the workforce at its new premises over time.
The Group includes Pearson Engineering, which designs and manufactures a range of military equipment for customers, including BAE Systems. It also produces equipment for oil and gas fields.
Chairman John Reece said: "This landmark site will be fully used as a manufacturing facility. We have also acquired plant and machinery no longer required by BAE Systems which will be used

In [4]:
# get summary statistics for source model performance. 
# This is just the inference response file. We will measure the quality and cost in the later steps

source_df = pd.read_csv('../outputs/document_summarization_source_model.csv')
print("\nSource Model Performance Summary:")
print(f"Average latency: {source_df['latency'].mean():.2f} seconds")
print(f"Average input tokens: {source_df['model_input_tokens'].mean():.0f}")
print(f"Average output tokens: {source_df['model_output_tokens'].mean():.0f}")


Source Model Performance Summary:
Average latency: 1.66 seconds
Average input tokens: 656
Average output tokens: 90


## Candidate Model Selection

### Selection Strategy

Selecting the right models for evaluation is crucial for a successful migration. We should consider several factors including:

- Input and output modalities
- Context window size
- Cost per inference/token
- Performance capabilities
- Domain specialization

Our source model is a small-sized foundation model optimized for document summarization. Based on our analysis of available models in AWS Bedrock, we've selected two promising candidates:

1. **Amazon Nova Lite (amazon.nova-lite-v1:0)**
   - A lightweight, cost-efficient model
   - Designed for efficient text processing
   - Lower token pricing compared to the source model
   - Well-suited for concise summarization tasks

2. **Meta's Llama 3.2 11B (us.meta.llama3-2-11b-instruct-v1:0)**
   - Open-weights model with strong performance
   - Enhanced instruction following capabilities
   - Competitive token pricing
   - Available through cross-region inference

> **Note:** For a comprehensive comparison of Bedrock models and their capabilities, refer to [AWS Bedrock's model documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html).

In a real migration scenario, you might test more models or different variants based on your specific requirements and constraints.


## Progress Tracking Setup

### Creating an Evaluation Framework

As we progress through this workshop, we'll evaluate multiple models across various dimensions. To maintain organization and enable comprehensive comparison at the end, we'll create a tracking system.

The tracking framework will help us maintain consistency and capture all relevant metrics.

We'll start by creating a simple dataframe with our source and candidate models. As we progress through the workshop, we'll add more metrics and evaluation results to this tracking system.

> **Note:** If you're re-running this notebook, we'll first clean up any existing tracking file to start fresh.


In [10]:
## clean up the tracking file. It is necessary if you rerun the notebooks

import os

file_path = "../data/evaluation_tracking.csv"

if os.path.exists(file_path):
    try:
        os.remove(file_path)
        print(f"File {file_path} has been deleted successfully")
    except Exception as e:
        print(f"Error occurred while deleting {file_path}: {str(e)}")
else:
    print(f"File {file_path} does not exist. You can continue the workshop.")
    

File ../data/evaluation_tracking.csv has been deleted successfully


In [11]:
## Supported foundation models in Amazon Bedrock: https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html

data = {
    'model': ['source_model', ## the model we are going to evaluate against, DO NOT CHANGE THIS NAME
              'amazon.nova-lite-v1:0', 
             # 'us.meta.llama3-2-11b-instruct-v1:0'] ## LLAMA 3.2 11b only avaialbe through cross region inference. read more: https://docs.aws.amazon.com/bedrock/latest/userguide/inference-profiles-support.html#inference-profiles-support-system
             # 'anthropic.claude-3-haiku-20240307-v1:0']
              'us.anthropic.claude-3-5-haiku-20241022-v1:0'] ## cross region inference
}

# Create the DataFrame
evaluation_tracking = pd.DataFrame(data)

evaluation_tracking['model_clean_name'] = evaluation_tracking['model'].str.replace(":", "-")

print(evaluation_tracking)

# Save the DataFrame to a CSV file
evaluation_tracking_file = '../data/evaluation_tracking.csv'
evaluation_tracking.to_csv(evaluation_tracking_file, index=False)
print(f"\nDataFrame saved to {evaluation_tracking_file}")


                                         model  \
0                                 source_model   
1                        amazon.nova-lite-v1:0   
2  us.anthropic.claude-3-5-haiku-20241022-v1:0   

                              model_clean_name  
0                                 source_model  
1                        amazon.nova-lite-v1-0  
2  us.anthropic.claude-3-5-haiku-20241022-v1-0  

DataFrame saved to ../data/evaluation_tracking.csv


## Summary and Next Steps

### What We've Accomplished

In this notebook, we've completed the critical first phase of the LLM migration process:

1. ✅ **Defined our business case** and understood the limitations of our current solution
2. ✅ **Established clear success criteria** across quality, performance, and cost dimensions
3. ✅ **Examined our evaluation dataset** and its characteristics
4. ✅ **Analyzed our source model's performance** to establish a baseline
5. ✅ **Selected promising candidate models** for evaluation
6. ✅ **Created a tracking system** to organize our evaluation process

### Next Steps

Move to Step 2