## Introduction

This notebook demonstrates how to set up an evaluation pipeline for GenAI-generated output, using Snorkel's Evaluation feature set.

The evaluation workflow has four main phases:

1. [Onboarding artifacts](#Phase-1:-Onboarding-artifacts)
2. [Creating the initial evaluation benchmark](#Phase-2:-Creating-the-initial-evaluation-benchmark)
3. [Refining the benchmark](#Phase-3:-Refining-the-benchmark)
4. [Improving your GenAI application](#Phase-4:-Improving-your-GenAI-application)

This notebook covers all four phases, using a small dataset of chatbot responses from an example GenAI chatbot application that answers questions about medical insurance for customers. For more about this example use case, read the [Evaluate GenAI output](https://docs.snorkel.ai/docs/0.95/user-guide/use-cases/genai-evaluation-notebook) tutorial in the Snorkel documentation.

After running this notebook, you'll have a functional end-to-end benchmarking pipeline that allows you to track progress against multiple criteria for sequential generations of the chatbot response data.

### Prerequisites

Have the following resources available before you run this notebook:

- A Snorkel Flow instance
- A Superadmin API key for Snorkel Flow
- Amazon SageMaker with authentication secrets (used for fine tuning; you can skip that step)
- The name of the LLM you want to use for any LLM-based functionality
- The data files you will download below

### Upload files

Download these files from Snorkel. The four CSV files are the sample data, and the binary file is a model used to create a slice of Spanish-language data.

Upload all of the files to Snorkel Flow, in the same directory as this notebook.

- [eval-spanish.csv](https://snorkel-docs-downloads.s3.amazonaws.com/eval/eval-spanish.csv)
- [eval-train-1.csv](https://snorkel-docs-downloads.s3.amazonaws.com/eval/eval-train-1.csv)
- [eval-train-2.csv](https://snorkel-docs-downloads.s3.amazonaws.com/eval/eval-train-2.csv)
- [eval-valid.csv](https://snorkel-docs-downloads.s3.amazonaws.com/eval/eval-valid.csv)
- [eval-lid.176.bin](https://snorkel-docs-downloads.s3.amazonaws.com/eval/eval-lid.176.bin)

### User input

**<span style="color:red;">This section requires input from the user.</span>**

In [None]:
# Set the Snorkel workspace; the example uses "default"
workspace_uid = 1
workspace_name = "default"

This notebook creates many assets like datasets, apps, and slices. If you run asset creation cells multiple times, you will receive naming conflict errors because the asset was created on the first run. If you want to re-run this notebook, replace the `app_name` below with a new name and start from the beginning.

In [None]:
# Use your Superadmin Snorkel Flow API key
api_key = None  # replace this with your SnorkelFlow API key

# Define your app name
app_name = "evaluation-example-app"

# Pick which LLM you want to use for LLM-based functionality
model_name = "openai/gpt-4o-mini"

Add your Amazon SageMaker access details. This is needed for the demonstration of fine-tuning the model powering your GenAI application. If you want to skip the fine-tuning portion, the rest of the notebook will still function.

In [None]:
AWS_ACCESS_KEY_ID_SECRET = "YOUR_ID_SECRET"
AWS_SECRET_ACCESS_KEY_SECRET = "YOUR_KEY_SECRET"
SAGEMAKER_EXECUTION_ROLE_SECRET = "YOUR_ROLE_SECRET"

Once you've completed the user input section, you can run the rest of the notebook as-is.

### Set up Snorkel Flow app

Import the Snorkel SDK and other utilities.

In [None]:
import logging
import warnings

logging.basicConfig(
    level=logging.ERROR
)  # Set the root logger to only show error-level messages
warnings.filterwarnings(
    "ignore"
)  # use warnings.resetwarnings() if you want to display warnings for debugging purposes

In [None]:
# Imports
import snorkelflow.client as sf
import snorkelflow.sdk as sf_sdk
from snorkelflow.sdk.fine_tuning_app import FineTuningApp
from snorkelflow.types.finetuning import (
    AnnotationStrategy,
    FineTuningAppConfig,
    FineTuningColumnType,
)
from snorkelflow.types.source import ModelSourceMetadata
from snorkelflow.sdk import Dataset
from snorkelflow.sdk import ModelNode
from snorkelflow.sdk.slices import Slice
import pandas as pd

In [None]:
# Authenticate with Snorkel Flow
ctx = sf.SnorkelFlowContext.from_kwargs(
    api_key=api_key,
    workspace_name=workspace_name,
)

In [None]:
# Import the first train and valid data splits
df_t1 = pd.read_csv("./eval-train-1.csv")
df_v1 = pd.read_csv("./eval-valid.csv")
print(df_t1.columns)
print(df_v1.columns)

Expected output:

```
Index(['question', 'response', 'retrieved_context_clean', 'prompt_prefix'], dtype='object')
Index(['question', 'response', 'retrieved_context_clean', 'prompt_prefix'], dtype='object')
```

In [None]:
# Create a fine-tuning app config, map data fields, and create the new Snorkel Flow app
cm = {
    "question": "instruction",
    "response": "response",
    "prompt_prefix": "prompt_prefix",
    "retrieved_context_clean": "context",
}
app_config = FineTuningAppConfig(column_mappings=cm)

FineTuningApp.create(app_name, app_config)
ft_app = FineTuningApp.get(app_name)

Expected output:

```
Successfully created dataset evaluation-example-app with UID XXXX.
Successfully created label schema Response acceptance with UID XXXX.
Successfully created dataset view Single response view with uid XXXX
Successfully created fine tuning application evaluation-example-app with uid XXXX
```

This workflow introduces a model_source object, which maps data sources in the parent dataset to different experiments. For example, our initial data is coming in from our alpha experiment, and it has a different model source id than data we will onboard later from a beta experiment. This object distinguishes metrics across evaluations and determine the data a user is developing on.

In [None]:
# Define the base LLM provider
llama3_8B_v0 = "llama3-8B-v0"
experiment1 = ft_app.register_model_source(
    llama3_8B_v0, metadata=ModelSourceMetadata(model_name=llama3_8B_v0)
)["source_uid"]

# Import data to the app
ft_app.import_data(data=df_t1, split="train", source_uid=experiment1, sync=True)
ft_app.import_data(data=df_v1, split="valid", source_uid=experiment1, sync=True)

The data import can take 2-5 minutes to complete.

Expected output:

```
Successfully retrieved dataset evaluation-example-app with UID XXXX.
+0.59s Waiting for next available worker (position in queue: 1)

...

+68.02s (100%) Transactions committed.

'rq-sbU06y4a_engine-YUYK_prep-and-ingest-fine-tuning-data'
```

In [None]:
# # Uncomment this cell if you want to import an augmented dataset
# augmented_df = sf.augment_dataset(
#     dataset=ft_app.dataset_uid,
#     x_uids=ft_app.get_dataframe("train").index.tolist()[:5],
#     model_name=model_name,
#     runs_per_prompt=2,
#     fields=["question", "response"],
#     temperature=1.5,
# )
#
# ft_app.import_data(
#     data=augmented_df,
#     split="train",
#     source_uid=ft_app.register_model_source("input_data_augmentation", metadata=ModelSourceMetadata(model_name="gpt4o"))['source_uid'],
#     sync=True
# )

# Phase 1: Onboarding artifacts

## Define evaluation criteria

Before you can evaluate the chatbot's response quality, you need to define what constitutes a high quality response.

In this example, you define six individual criteria for a high-quality response:

- Completeness
- Correctness
- Polite Tone
- Retrieved Context Accuracy
- unit_test_contains_pii
- unit_test_mentions_competitors

The chatbot responses can be assessed against these criteria in multiple ways:

- Collect ground truth ratings from domain experts. For example, your support team could label responses as having a "Polite Tone" or not.
- Use Snorkel evaluators to assess responses programattically. When you use evaluators, you can scale quality measurements and evaluate subsequent experiments with little involvement from our domain experts.

In [None]:
# Define a name and description for each criterion
criteria = [
    {
        "name": "Completeness",
        "description": "Completeness refers to the organization and thoroughness of the chatbot's response. It evaluates how well the chatbot breaks down complex concepts, provides logical explanations, and uses appropriate examples or analogies to aid understanding. Evaluators should assess whether the explanation covers all necessary information, details, and components needed to provide a thorough and satisfactory answer.",
    },
    {
        "name": "Correctness",
        "description": "Correctness measures the accuracy and relevance of the chatbot's response to the user's query. It involves assessing whether the provided information is factually correct, up-to-date, and directly addresses the user's question or request. Evaluators should consider the overall accuracy of the response, including any claims, statistics, or data mentioned, and determine if the information is partially or fully correct in relation to the query.",
    },
    {
        "name": "Polite Tone",
        "description": "Polite Tone assesses the chatbot's ability to maintain a respectful and courteous demeanor in its responses. Evaluators should consider the use of polite language, appropriate greetings and sign-offs, and the overall friendliness of the interaction. They should also note whether the chatbot responds to frustration or criticism with patience and empathy, avoiding dismissive or confrontational language, and maintaining professionalism throughout the conversation.",
    },
    {
        "name": "Retrieved Context Accuracy",
        "description": "This criterion evaluates how well the AI's response aligns with and correctly uses retrieved information. Consider relevance to the instruction, factual correctness, proper context, completeness, consistency with retrieved content, source attribution (if applicable), appropriate synthesis with general knowledge, and handling of uncertainty. Assess whether the response directly addresses the query using pertinent retrieved data, avoids misrepresentation or omission of important details, and logically combines retrieved information with the AI's knowledge base. If information is incomplete or ambiguous, check if the AI appropriately expresses limitations. Rate the pair on a scale based on how accurately the response utilizes retrieved information to address the given instruction.",
    },
    {
        "name": "unit_test_contains_pii",
        "description": "This criterion assesses whether the chatbot's response includes any information that could be used to identify a specific individual. Evaluators should look for explicit mentions of personal details such as names, addresses, phone numbers, or email addresses. They should also be aware of indirect references or contextual information that, when combined, could lead to identification of an individual.",
    },
    {
        "name": "unit_test_mentions_competitors",
        "description": "This criterion evaluates whether the chatbot's response mentions or discusses competing products, services, or companies, especially when not directly relevant to the user's query. Evaluators should identify any mentions of competitors, consider their relevance to the user's question, and flag instances where the chatbot compares itself to competitors or makes inappropriate recommendations about them.",
    },
]

Evaluation criteria are defined as dataset label schemas in Snorkel Flow.

In [None]:
# Create a dataset label schema for each evaluation criterion
d = sf_sdk.Dataset.get(app_name)
for schema in criteria:
    try:
        d.create_label_schema(
            name=schema["name"],
            data_type="text",
            task_type="classification",
            description=schema["description"],
            label_map=["ACCEPT", "REJECT"],
        )
    except:
        print(schema)
    continue

Expected output:

```
Successfully retrieved dataset evaluation-example-app with UID XXXX.
Successfully created label schema Completeness with UID XXXX.
Successfully created label schema Correctness with UID XXXX.
Successfully created label schema Polite Tone with UID XXXX.
Successfully created label schema Retrieved Context Accuracy with UID XXXX.
Successfully created label schema unit_test_contains_pii with UID XXXX.
Successfully created label schema unit_test_mentions_competitors with UID XXXX.
```

In [None]:
# Create a dataset view for the label schemas
# This allows you to create annotation batches targeted to your criteria
from snorkelflow.client.dataset_views import create_dataset_view

cm_flipped = {v: k for k, v in cm.items()}
label_schema_uids = [label_schema.uid for label_schema in d.label_schemas]
sf.create_dataset_view(
    dataset=app_name,
    name="Native GenAI Viewer",
    view_type="single_llm_response_view",
    column_mapping=cm_flipped,
    label_schema_uids=label_schema_uids,
)

Expected output:

```
{'dataset_uid': XXXX,
 'name': 'Native GenAI Viewer',
 'view_type': 'single_llm_response_view',
 'column_mapping': {'instruction': 'question',
  'response': 'response',
  'prompt_prefix': 'prompt_prefix',
  'context': 'retrieved_context_clean'},
 'label_schema_uids': [YOUR_SCHEMA_UIDS],
 'dataset_view_uid': XXXX}
```

The dataset view created above allows annotators to mark either `ACCEPT` or `REJECT` for the evaluation criteria.

To capture additional valuable feedback from annotators, you will want to create a free text feedback field too. Annotators can use this field to write gold standard responses, or to give a rationale for why they rejected a particular response.



In [None]:
# Define a name and description for the free text field for annotators
free_text_criteria = [
    {
        "name": "rationale_if_incorrect",
        "description": "Document the reasoning behind marking a response as incorrect and provide valuable feedback for improving the LLMs performance.",
    }
]

# Create the label schema for the rationale field
for schema in free_text_criteria:
    ctx.tdm_client.post(
        "label-schemas",
        json=dict(
            {
                "dataset_uid": ft_app.dataset_uid,
                "name": schema["name"],
                "description": schema["description"],
                "data_type": "text",
                "task_type": "classification",
                "is_multi_label": False,
                "label_map": {},
                "label_descriptions": {},
                "primary_text_field": "text",
                "is_text_label": True,
            }
        ),
    )

#### Create annotation batch for the first set of chatbot responses

Get the first generation of chatbot responses ready for evaluation.

Before creating the batch, create a reusable function that creates targeted annotation batches from different generations of your chatbot response datasets. These generations are called experiments.

In [None]:
# Define a helper function to get the UID for an experiment
def get_experiment_uid(experiment_name):
    # Check if the 'source' key exists and has the correct structure
    res = None
    for (
        k,
        v,
    ) in ft_app.datasource_metadata.items():
        if v["source"]["source_name"] == experiment_name:
            # Return the model name
            res = v["source_uid"]
    return res


# Define a helper function to create a new annotation batch with the dataset from a particular experiment
def create_custom_batch(
    name,
    dataset_name,
    experiment_name,
    split,
    label_schema_names,
    batch_size=None,
    x_uids=[],
):
    from snorkelflow.sdk import Dataset

    ft_app = FineTuningApp.get(dataset_name)
    d = Dataset.get(dataset=dataset_name)
    schemas_in_batch = []
    for ls in d.label_schemas:
        if ls.name in label_schema_names:
            schemas_in_batch.append(ls)
    experiment_uid = get_experiment_uid(experiment_name=experiment_name)
    if experiment_uid:
        res = d.create_batches(
            name=name,
            label_schemas=schemas_in_batch,
            batch_size=batch_size,
            x_uids=x_uids
            + list(
                ft_app.get_dataframe(split=split, source_uids=[experiment_uid]).index
            ),
        )
        return res
    else:
        return "No experiment with that name found, batch not created"

In [None]:
# Create the annotation batch for the first set of responses, from the "train" split imported earlier
create_custom_batch(
    name="custom-sdk-batch",
    dataset_name=app_name,
    split="train",
    experiment_name=llama3_8B_v0,
    label_schema_names=["Completeness", "Correctness"],
    batch_size=100,
)

Expected output:

```
Successfully retrieved dataset evaluation-example-app with UID XXXX.
Successfully retrieved dataset evaluation-example-app with UID XXXX.
Successfully retrieved dataset evaluation-example-app with UID XXXX.
Successfully created 1 batches.

[<snorkelflow.sdk.batch.Batch at xxxxxxxxxxxxxx>]
```

## Define evaluators

Evaluators are functions that programmatically check whether an LLM response satisfies a criterion. They greatly accelerate evaluation, reducing or eliminating the need for manual annotation of each response against each criterion.

An evaluator can be anything that has the signature:

```
(prompt, response, [context]) → {0,1}
```

Examples include:

- An off-the-shelf classifier (e.g. for PII/toxicity)
- A prompted LLM, the key ingredient of an LLM-as-a-judge evaluator
- A heuristic rule
- An automated or heuristic comparison to an SME-annotated gold dataset (for example, a similarity score, calculated from an embedding or by an LLM, between the responses being evaluated and gold SME responses)
- A predictive model built with programmatic supervision

This notebook contains heuristic and LLM-as-judge evaluators.

#### Evaluator: Heuristic evaluator checks for mention of competitors

In [None]:
# Define a heuristic evaluator
# This evaluator assesses whether the chatbot's response mentions competitors
def mentions_competitors(df: pd.DataFrame) -> float:
    import pandas as pd

    competitor_list = [
        "Anthem",
        "UnitedHealth Group",
        "Cigna",
        "Aetna (CVS Health)",
        "Humana",
        "Centene Corporation",
        "Kaiser Permanente",
        "Blue Cross Blue Shield Association",
        "Molina Healthcare",
        "Health Care Service Corporation (HCSC)",
        "Highmark",
    ]
    # Evaluators are calculated on the fly slice-wise. Account for slices that don't have an x_uid for a given split.
    if len(df) == 0:
        raise ValueError(f"No samples found")
    df["response_has_competitors"] = df["response"].apply(
        lambda x: any(s in x for s in competitor_list)
    )
    return df["response_has_competitors"].mean()


# Test the function on a sample of data before registering it as a custom metric
t = ft_app.get_dataframe(split="train")
t_comp = mentions_competitors(t)
print(f"Percentage of sample data that mentions competitors: {t_comp}")

Expected output:

```
Percentage of sample data that mentions competitors: 0.06
```

In [None]:
# Register the metric with the application
ft_app.register_custom_metric(
    metric_name="Heuristic | Mentions Competitors",
    metric_func=mentions_competitors,
    overwrite=True,
)

#### Evaluator: LLM-as-judge evaluator checks for response completeness

Use an LLM-as-judge to measure the Completeness criterion. Start by listing available models.

In [None]:
# Get a list of available foundation models to prompt
sf.get_external_model_endpoints()

Expected output:

```
{'openai/gpt-4o': 'https://api.openai.com/v1/chat/completions',
 'deepset/roberta-large-squad2': 'https://pxyrwkggvs7nr8x7.us-east-1.aws.endpoints.huggingface.cloud',
 'google/flan-t5-xl': 'https://edpowgnae37imiew.us-east-1.aws.endpoints.huggingface.cloud',
 'vertexai_lm/text-bison@001': 'https://cloud.google.com/vertex-ai',
 'openai/gpt-4o-mini': 'https://api.openai.com/v1/chat/completions',
 'vertexai_lm/chat-bison@001': 'https://cloud.google.com/vertex-ai',
 'vertexai_lm/gemini-1.0-pro': 'https://cloud.google.com/vertex-ai',
 'impira/layoutlm-document-qa': 'https://uep8go5hobns4u1w.us-east-1.aws.endpoints.huggingface.cloud',
 'openai/gpt-3.5-turbo': 'https://api.openai.com/v1/chat/completions',
 'openai/gpt-4': 'https://api.openai.com/v1/chat/completions',
 'mistral.mistral-7b-instruct-v0:2': 'http://Bedroc-Proxy-6ZRulff8IuuZ-1976839026.us-west-2.elb.amazonaws.com/api/v1',
 'meta.llama3-8b-instruct-v1:0': 'http://Bedroc-Proxy-6ZRulff8IuuZ-1976839026.us-west-2.elb.amazonaws.com/api/v1',
 'azure_openai/jioh-gpt-4o-mini': 'https://jiohopenaiinstance.openai.azure.com/chat/completions'}
```

Engineer a prompt that enables an LLM to act as a judge in assessing whether the health insurance chatbot's answers pass or fail the Completeness criterion.

After defining the prompt, it's a best practice to run it on a subset of the data points before running it over the entire dataset.

In [None]:
# Define an LLMAJ evaluator with a custom prompt and model
from snorkelflow.evaluation.metric_schema import CustomPromptMetricSchema

pt = """As an AI judge evaluating a healthcare copilot, your task is to assess the Completeness of the chatbot's responses. Focus on how well the chatbot clarifies complex medical concepts, provides logical explanations, and uses relevant examples or analogies. Consider the following:

1. Ease of comprehension: Is the explanation easy to understand for a general audience?
2. Logical flow: Does the response progress in a coherent, step-by-step manner?
3. Use of examples/analogies: Are appropriate comparisons made to aid comprehension?
4. Jargon management: Is medical terminology adequately explained when used?
5. Completeness: Does the explanation cover all necessary aspects of the topic?

If the response is very complete according to the items outlined above, give it a 1.
If the response is not complete according to the items outlined above, give it a 0.

Only respond with one number, either 0 or 1, and NO OTHER NUMBERS OR TEXT.
    Response: {response}
"""

LLMAJ_completeness = CustomPromptMetricSchema(
    metric_type="custom_prompt",
    display_name="LLMAJ | Completeness",
    description="LLMAJ for Completeness Criteria",
    display_style="percentage",
    prompt_text=pt,
    model_name=model_name,
)

In [None]:
from snorkelflow.client_v3.evaluation import preview_custom_prompt_metric

# Test the LLMAJ prompt for a sample of the chatbot responses
a = list(ft_app.get_dataframe().index)[:5]
res = preview_custom_prompt_metric(
    dataset=ft_app.dataset_uid,
    x_uids=a,
    metric_schema=LLMAJ_completeness,
)
res[["question", "response", "perplexity", "score"]]

It looks like our LLM-as-judge is generating the values that we'd expect!

#### Register the LLMAJ evaluator

Register the LLMAJ completeness evaluator in Snorkel Flow so its results are available to the application's evaluation dashboard.

In [None]:
# Register the metric with the application
ft_app.register_metric(LLMAJ_completeness)

## Define reference prompts

Define the set of prompts that you want to use across different generations of evaluation.

As you fine tune the model powering your chatbot, the responses will change, but these prompts will stay the same so you have a consistent reference point for the evaluation scores. You may add to this set of prompts if you want to evaluate your model on a wider array of inputs.

In [None]:
from snorkelflow.sdk import ModelNode

node = ModelNode.get(node_uid=ft_app.model_node_uid)
df = node.get_dataframe()
df["question"].values[:5]

Expected output:

```
array(['What is the procedure for updating personal information and dependents for coverage?',
       'What are the specific age limits for dependent coverage under my plan?',
       "How to challenge partial denial of a dependent's medical claim?",
       'Can you explain the process of prior authorization for specific treatments or medications?',
       'How is coverage determined for experimental or investigational treatments?'],
      dtype=object)
```

Next, let's retrieve some entries from the sample dataset so we can examine the data in more detail. The dataset has several fields:

- **question**: The instruction passed from the user to the alpha version of the GenAI chatbot application.
- **response**: The response from the healthcare chatbot.
- **preference**: Some pre-collected ground truth measuring overall response quality.
- **metadata_label**: We trained an intent classifier offline and are using its values here during eval to help define data slices.
- **retrieved_context_clean**: The chatbot uses a retrieval augmented generation (RAG) pipeline to aid in factually grounding responses in the organization's knowledge base. To render this field properly for annotators, ensure that each cell in this column contains valid JSON.
- **prompt_prefix**: This is the static portion of the prompt that's passed to the LLM with each user interaction. This is also referred to as the system prompt.

In [None]:
# Retrieve a few data points
df.head()

Expected output:

Jupyter should render a table with some entries from the sample data.

## Define data slices

Data slices sort and codify inputs with different characteristics. These should relate to categories of users or categories of questions that matter to your business.

In this healthcare example, you will define slices for:

- Verbose questions
- Questions written in Spanish
- Disputes
- Questions about out-of-network coverage
- Questions about premiums

In [None]:
# Imports
from tqdm import tqdm

tqdm.pandas()
from snorkelflow.sdk.slices import Slice, SliceConfig
from snorkelflow.sdk import ModelNode
from templates import RegexTemplateSchema, KeywordTemplateSchema
from snorkelflow.utils.graph import DEFAULT_GRAPH

node = ModelNode.get(node_uid=ft_app.model_node_uid)


# Write a slicing function
def apply_slice(slicing_fn):
    df = node.get_dataframe()
    slice_mask = df.progress_apply(slicing_fn, axis=1)
    df_sliced = df[slice_mask]
    x_uids = list(df_sliced.index)
    slice_name = slicing_fn.__name__
    slice_percent = len(df_sliced) / len(df)
    print(
        f"Applied '{slice_name}' slice (n={len(df_sliced)}, {slice_percent*100:.2f}%)"
    )
    return df_sliced, x_uids

In [None]:
# Create a slice of data for verbose questions
def verbose_question(x):
    return len(x.question) > 100


verbose_df, verbose_x_uids = apply_slice(verbose_question)
verbose_slice = Slice.create(name="verbose_question", dataset=ft_app.dataset_uid)
verbose_slice.add_x_uids(verbose_x_uids)

Expected output:

```
Applied 'verbose_question' slice (n=9, 4.50%)
Successfully created slice verbose_question with UID XXXX.
Successfully added 9 datapoints to slice verbose_question.
```

In [None]:
# Create a slice of data for questions in Spanish
!pip install fasttext-wheel
import fasttext

model = fasttext.load_model("./eval-lid.176.bin")


def is_spanish(x):
    predictions = model.predict(x.question)
    lang = predictions[0][0].split("__label__")[-1]
    return lang == "es"


spanish_df, spanish_x_uids = apply_slice(is_spanish)
spanish_slice = Slice.create(name="spanish", dataset=ft_app.dataset_uid)
spanish_slice.add_x_uids(spanish_x_uids)

Expected output:

```
Defaulting to user installation because normal site-packages is not writeable
Collecting fasttext-wheel
  Downloading fasttext_wheel-0.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)

...

Applied 'is_spanish' slice (n=0, 0.00%)
Successfully created slice spanish with UID XXXX.
Successfully added 0 datapoints to slice spanish.
```

In [None]:
# Create a slice of data for disputes
topic_disputes_slice = Slice.create(
    name="topic_disputes",
    dataset=ft_app.dataset_uid,
    config=SliceConfig(
        templates=[
            RegexTemplateSchema(
                field="question",
                regex_pattern=r"\b(appeal|appealed|dispute|disputed|disputes)\b",
                case_sensitive=False,
            )
        ],
        graph=DEFAULT_GRAPH,
    ),
)

# Create a slice of data for questions about out of network topics
topic_out_of_network_slice = Slice.create(
    name="topic_out_of_network",
    dataset=ft_app.dataset_uid,
    config=SliceConfig(
        templates=[
            RegexTemplateSchema(
                field="question",
                regex_pattern=r"\b(network)\b",
                case_sensitive=False,
            )
        ],
        graph=DEFAULT_GRAPH,
    ),
)

# Create a slice of data for questions about premiums
topic_premiums_slice = Slice.create(
    name="topic_premiums",
    dataset=ft_app.dataset_uid,
    config=SliceConfig(
        templates=[
            RegexTemplateSchema(
                field="question",
                regex_pattern=r"\b(PREMIUMS)\b",
                case_sensitive=False,
            )
        ],
        graph=DEFAULT_GRAPH,
    ),
)

Expected output:

```
Successfully created slice topic_disputes with UID XXXX.
Successfully created slice topic_out_of_network with UID XXXX.
Successfully created slice topic_premiums with UID XXXX.
```

In [None]:
# Define a function to reapply slices
def reapply_slices(dataset_uid):
    # Manual
    verbose_df, verbose_x_uids = apply_slice(verbose_question)
    verbose_slice = Slice.get(dataset=dataset_uid, slice="verbose_question")
    verbose_slice.add_x_uids(verbose_x_uids)

    spanish_df, spanish_x_uids = apply_slice(is_spanish)
    spanish_slice = Slice.get(dataset=dataset_uid, slice="spanish")
    spanish_slice.add_x_uids(spanish_x_uids)

    # Programmatic
    slices = Slice.list(dataset=dataset_uid)
    slice_uids = [sl.slice_uid for sl in slices]
    slice_uids

    # Programmatic
    ctx.tdm_client.post(
        f"/dataset/{dataset_uid}/apply",
        json=dict({"dataset_uid": dataset_uid, "slice_uids": slice_uids}),
    )
    print(f"Successfully reapplied slices to the dataset {dataset_uid}")

# Phase 2: Creating the initial evaluation benchmark

After onboarding all the artifacts, it's time to run the initial benchmark and view the results in the evaluation dashboard in Snorkel Flow.

There are two ways to create evaluation reports: 1) from the fine-tuning application, and 2) as part of the standalone `evaluation` module. For simple evaluation reports, Snorkel recommends using Method 1, but for more complex, multi-criteria, multi-evaluator reports, Snorkel recommends Method 2.

Note: Any evaluation report that includes an LLM-based metric will take longer to compute because of the API calls to the model. The first run will take longer and subsequent runs will be quicker thanks to our built-in caching.

## Create a simple evaluation report

Use this for simpler evaluation reports.

In [None]:
# Create a simple, initial evaluation report
ft_app.create_evaluation_report()

Expected output:

```
No quality models found, skipping model acceptance rate metric
Successfully retrieved dataset evaluation-example-app with UID XXXX.
+0.58s Slice apply job for dataset XXXX
+0.59s Starting evaluation module compute metrics job
+1.28s Found and loaded datasources for models llama3-8B-v0 in dataset XXXX train split.
+1.97s No valid data found to calculate the metric Heuristic | Mentions Competitors.

...

{'dataset_uid': XXXX,

...

     'global': 100.0}}},
  'test': {}}}
```

### Define evaluation report helpers

In [None]:
# Create a helper function to get the latest report UID
def get_latest_report_uid(dataset_uid):
    reports = ctx.tdm_client.get(
        f"/dataset/{dataset_uid}/evaluation-report",
    )
    return reports[0].get("evaluation_report_uid")

In [None]:
# Create a helper function to update the report description
def update_report_description(dataset_uid, eval_report_uid, description):
    ctx.tdm_client.put(
        f"/dataset/{dataset_uid}/evaluation-report/{eval_report_uid}",
        json=dict(
            {
                "dataset_uid": dataset_uid,
                "evaluation_report_uid": eval_report_uid,
                "additional_notes": description,
            }
        ),
    )

## Create an evaluation report using multiple criteria (ground truth only)

Create a more complex evaluation report, starting with the human-annotated ground truth.

In [None]:
# Create a list of metric schemas to capture the ground truth inputs. Later, you will append a custom metric schema
from snorkelflow.client.utils import get_dataset_uid
from snorkelflow.client_v3 import evaluation
from snorkelflow.client_v3.evaluation import MetricSchema
from snorkelflow.evaluation.metric_schema import GTAcceptanceRateMetricSchema


# Use the previously-defined criteria. For each criteria, collect the label_schema_uid
def get_label_schema_uids(dataset, criteria):
    schema_uids = []
    for label_schema in dataset.label_schemas:
        if label_schema.name in criteria:
            schema_uids.append(label_schema.uid)
    return schema_uids


dataset_uid = int(sf.get_dataset_uid(app_name))
eval_dataset = sf_sdk.Dataset.get(app_name)


criteria_names = [c["name"] for c in criteria]
criteria_uids = get_label_schema_uids(dataset=eval_dataset, criteria=criteria_names)

# Define criteria based on ground truth
ground_truth_metric_schemas = [
    GTAcceptanceRateMetricSchema(
        display_name="GT | " + criteria[i]["name"],
        dataset_uid=dataset_uid,
        label_schema_uid=label_uid,
        accept_label_value=1,
    )
    for i, label_uid in enumerate(criteria_uids)
]
for gt_metric_schema in ground_truth_metric_schemas:
    ft_app.register_metric(gt_metric_schema)

Expected output:

```
Successfully retrieved dataset evaluation-example-app with UID XXXX.
```

In [None]:
# Create an evaluation report based on the ground truth
ft_app.create_evaluation_report()

Expected output:

```
+0.64s No valid data found to calculate the metric GT | Polite Tone.
+1.33s Found and loaded datasources for models  in dataset XXXX test split.
+2.02s (100%) Metric compute completed. Evaluation report created with UID XXXX.

{'dataset_uid': XXXX,

...

     'global': 100.0}}},
  'test': {}}}
```

In [None]:
# Update the report
eval_report_uid = get_latest_report_uid(dataset_uid)
update_report_description(dataset_uid, eval_report_uid, "Baseline")

## Create a complex evaluation report using ground truth and evaluator criteria

To create a more advanced version of the evaluation report, use Snorkel Flow's `evaluation` module. The `evaluation` module allows you to combine all of collected ground truth at the dataset level with all of your custom evaluators.

To combine the ground-truth based measurements with custom evaluators, serialize the evaluators defined above and pass them to the `CustomMetricSchema` class. Finally, create a new evaluation report with the ground truth metric schemas and the evaluator metric schemas.

In [None]:
# Add an LLM-based custom evaluator to the ground truth multi-criteria report
import inspect
from snorkelflow.evaluation.metric_schema import CustomMetricSchema
from snorkelflow.serialization.code_asset import serialize_asset

evaluator_metric_schemas = [LLMAJ_completeness]

mentions_competitors_custom_metric_schema = CustomMetricSchema(
    display_name="Heuristic | Mentions Competitors",
    description="If the response mentions competitors",
    serialized_custom_metric_func=serialize_asset(mentions_competitors),
    raw_code=inspect.getsource(mentions_competitors),
)
evaluator_metric_schemas.append(mentions_competitors_custom_metric_schema)

for evaluator_metric_schema in evaluator_metric_schemas:
    ft_app.register_metric(evaluator_metric_schema)

In [None]:
# Create the initial benchmark report
ft_app.create_evaluation_report()

Expected output:

```
+0.59s Found and loaded datasources for models llama3-8B-v0 in dataset 1932 train split.
+4.02s No valid data found to calculate the metric LLMAJ | Completeness.

...

{'dataset_uid': XXXX,

...

     'global': 100.0}}},
  'test': {}}}
```

Now you have an initial benchmark for your chatbot's response performance. You can view these initial data points in Snorkel Flow. Select your application and then select **Evaluate** to view the initial benchmark data points.

# Phase 3: Refining the benchmark

## General Guidelines

It's important to assess the accuracy and relevance of your benchmark, or your future evaluation metrics will be measured against an inaccurate standard. This section provides guidance and practical examples for refining your benchmark.

Ideally, each evaluator is validated as trustworthy in the early phases of an experiment, so it can be used to expedite developing and measuring the GenAI application.

You can create multiple reports for a single experiment, or generation, of your chatbot data. It's quite likely that you will create multiple reports for the first experiment as you refine the benchmark into one that you trust.

#### Collecting ground truth labels, using SME annotation

After running the first round of evaluators, Snorkel recommends collecting a small number of ground truth labels for each criterion to ensure the programmatic evaluators give scores similar to human scores.

If you did this before registering the evaluators, it's safe to skip this phase. For example, an enterprise-specific pre-trained PII/PHI model may not require SME annotation for the use case.

Snorkel recommends re-engaging domain experts throughout development for high leverage, ambiguous classes of errors, as well as in the final rounds of development as a pipeline is on its way to production.

#### Improving evaluators

Certain criteria may be too difficult for a single Evaluator. For example, an organization's definition of "Correctness" may be so broad that developers find that an Evaluator does not accurately scale SME preferences. In cases like this, Snorkel recommends one of the following:

- Break down the criteria into more fine-grained definitions that can be measured by a single Evaluator.
- Rely on high-quality annotations for that criteria during development.
- Collect gold standard responses and create a custom evaluator to measure similarity to the collected gold standard response.

### More best practices for refining the benchmark

- **If most of your data isn't captured by data slices**: Consider refining or writing new slicing functions.
- **If a high-priority data slice is under-represented in your dataset**: Consider using Snorkel's synthetic data generation modules (SDK) to augment your existing dataset. Also consider retrieving a more diverse instruction set from an existing query stream or knowledge base.
- **If an evaluator is innaccurate**: Use the data explorer to identify key failure modes with the evaluator, and create a batch of these innaccurate predictions for an annotator to review. Once ground truth has been collected, you can scale out these measurements via a fine-tuned quality model or include these as few-shot examples in a prompt-engineered LLM-as-judge.
- **To scale a criterion currently measured via ground truth**: From the data explorer dialog inside the evaluation dashboard, select **Go to Studio**. Use the criterion's ground truth and Snorkel's Studio interface to write labeling functions, train a specialized model for that criterion, and register it as a custom evaluator. These fine-tuned quality models can also be used elsewhere in the platform for LLM Fine-tuning and RAG tuning workflows.

## Example: Add or refine reference prompts

The initial training dataset lacked significant examples of Spanish-language chatbot prompts and responses. You can add more prompts by uploading another datasource.

In [None]:
# Upload existing prompts, translated into Spanish
import pandas as pd

spanish_data = pd.read_csv("./eval-spanish.csv")
spanish_questions = spanish_data.question.tolist()

In [None]:
# Use Snorkel's synthetic data generation module to augment the Spanish dataset
augmented_questions = sf.augment_data(
    data=spanish_questions,
    model_name=model_name,
    runs_per_prompt=2,
)
augmented_questions

Expected output:

```
+0.59s (100%) Running inference.
```

You should see a Jupyter-generated table of Spanish chatbot questions and responses about health insurance.

In [None]:
# Add the augmented Spanish data to the existing dataset
ft_app.import_data(data=spanish_data, split="train", source_uid=experiment1, sync=True)

Expected output:

```
Successfully retrieved dataset evaluation-example-app with UID XXXX.
+0.59s Starting fine tuning data ingestion
+1.98s (100%) Data ingestion complete
Data ingestion complete
+0.59s Creating active datasource for op node
+1.29s (1%) Processing data source 1 / 1: starting.
+11.68s (1%) Processing data source 1 / 1: starting.
+22.08s (1%) Processing data source 1 / 1: starting.
+32.45s (1%) Processing data source 1 / 1: starting.
+34.53s (1%) Processing labels.
+35.22s (100%) Transactions committed.

'rq-DsLAvst2_engine-tr9a_prep-and-ingest-fine-tuning-data'
```

In [None]:
# Use the reapply slices helper function created earlier
reapply_slices(ft_app.dataset_uid)

Expected output:

```
Applied 'verbose_question' slice (n=11, 5.26%)
Successfully added 11 datapoints to slice verbose_question.
Applied 'is_spanish' slice (n=9, 4.31%)
Successfully added 9 datapoints to slice spanish.
Retrieving slices for dataset XXXX.
Successfully reapplied slices to the dataset XXXX
```

In [None]:
# Create a new evaluation report for the first experiment with the additional data
ft_app.create_evaluation_report()

Expected output:

```
+0.59s Found and loaded datasources for models llama3-8B-v0 in dataset XXXX train split.
+11.01s Found and loaded datasources for models llama3-8B-v0 in dataset XXXX train split.

...

{'dataset_uid': XXXX,

...

     'global': 100.0}}},
  'test': {}}}
```

In [None]:
# Add a description to the latest report
eval_report_uid = get_latest_report_uid(dataset_uid)
update_report_description(dataset_uid, eval_report_uid, "Add Spanish data")

## Example: Add ground truth labels

This section walks through the process for creating a targeted batch of data that you would like to send to annotators to collect human feedback on. In this case, you create a batch for humans to assess on the "Completeness" criterion, so you can use that to assess the accuracy of your LLMAJ evaluator that's assessing the same thing.

In [None]:
# Collect ground truth for Completeness criteria to evaluate the LLMAJ | Completeness Evaluator
specific_x_uids = (
    []
)  # if you want to include specific datapoints in the batch, you can include their uids here
create_custom_batch(
    name="completeness-batch",
    dataset_name=app_name,
    split="train",
    experiment_name=llama3_8B_v0,
    label_schema_names=["Completeness"],
    batch_size=100,
    x_uids=specific_x_uids,
)

Expected output:

```
Successfully retrieved dataset evaluation-example-app with UID XXXX.
Successfully retrieved dataset evaluation-example-app with UID XXXX.
Successfully retrieved dataset evaluation-example-app with UID XXXX.
Successfully created 1 batches.
[<snorkelflow.sdk.batch.Batch at 0x7f1150723550>]
```

#### External input required: Annotate the batch

At this point, you would engage your annotators to label the new batch of data.

Snorkel's `evaluation` module combines the ground truth human metric with the LLMAJ metric.

In [None]:
# Create a new evaluation report for the first experiment with the updated metric for Completeness
ft_app.create_evaluation_report()

Expected output:

```
+0.59s Found and loaded datasources for models llama3-8B-v0 in dataset XXXX train split.
+11.15s Found and loaded datasources for models llama3-8B-v0 in dataset XXXX train split.

...

{'dataset_uid': XXXX,

...

     'global': 100.0}}},
  'test': {}}}
```

In [None]:
# Add a description to the latest report
eval_report_uid = get_latest_report_uid(dataset_uid)
update_report_description(dataset_uid, eval_report_uid, "Add completeness GT")

# Phase 4: Improving your GenAI application

Now that you have a trustworthy benchmark, you can use a variety to techniques to improve your GenAI application.

The key task at this stage is to identify classes of errors and fix them. You can do this with a variety of techniques, including:

- **LLM fine tuning**: Fine tuning allows you to change the LLM's parameters to adapt its performance to your criteria. You can use Snorkel Flow to programmatically curate a high quality, diverse training dataset that’s passed to an LLM for fine-tuning. Generated responses are brought back to Snorkel Flow for response quality labeling, error analysis, and iterative development. Amazon SageMaker is one option for fine tuning, demonstrated below.
- **RAG tuning**: Improve the chunking, embedding, or metadata in your vector database. On request, Snorkel can provide an example notebook with instructions for using Snorkel to tune a RAG system.
- **Prompt development**: Snorkel Flow's prompt development features will be released in early 2025.

### Fine tune the model powering your GenAI application using Amazon SageMaker

This section relies on access and authentication secrets for Amazon SageMaker, which were defined in the [user input section](#User-input) at the top of the notebook.

In [None]:
# Imports
import boto3
from sagemaker import Session
from snorkelflow.sdk.fine_tuning_app import ExternalModelTrainer
from snorkelflow.types.finetuning import FinetuningProvider

In [None]:
# Establish a connection to SageMaker
AWS_ACCESS_KEY_ID = "aws::finetuning::access_key_id"
AWS_SECRET_ACCESS_KEY = "aws::finetuning::secret_access_key"
SAGEMAKER_EXECUTION_ROLE = "aws::finetuning::sagemaker_execution_role"
FINETUNING_AWS_REGION = "aws::finetuning::region"
sf.set_secret(
    AWS_ACCESS_KEY_ID,
    AWS_ACCESS_KEY_ID_SECRET,
    secret_store="local_store",
    workspace_uid=workspace_uid,
    kwargs=None,
)
sf.set_secret(
    AWS_SECRET_ACCESS_KEY,
    AWS_SECRET_ACCESS_KEY_SECRET,
    secret_store="local_store",
    workspace_uid=workspace_uid,
    kwargs=None,
)
sf.set_secret(
    SAGEMAKER_EXECUTION_ROLE,
    SAGEMAKER_EXECUTION_ROLE_SECRET,
    secret_store="local_store",
    workspace_uid=workspace_uid,
    kwargs=None,
)
sf.set_secret(
    FINETUNING_AWS_REGION,
    "us-west-2",
    secret_store="local_store",
    workspace_uid=workspace_uid,
    kwargs=None,
)
boto_session = boto3.Session(
    aws_access_key_id=AWS_ACCESS_KEY_ID_SECRET,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY_SECRET,
    region_name="us-west-2",
)
sagemaker_client = boto_session.client("sagemaker")
sagemaker_runtime_client = boto_session.client("sagemaker-runtime")
sagemaker_session = Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_client,
    sagemaker_runtime_client=sagemaker_runtime_client,
)

# Configurations for the fine-tuning job
finetuning_configs = {
    "epoch": "1",
    "instruction_tuned": "True",
    "validation_split_ratio": "0.1",
    "max_input_length": "1024",
    "chat_dataset": "False",
}
training_configs = {
    # g5.12xlarge is faster but not currently available in us-west-2
    # "instance_type": "ml.g5.12xlarge",
    # use the below one for concurrency testing, its cheap and slow
    # "instance_type": "ml.g4dn.12xlarge"
    # if you want faster loop time, use the below instance type, we only have
    # quota of 1 instance for this in us-west-2
    "instance_type": "ml.p3dn.24xlarge"
}

external_model_trainer = ExternalModelTrainer(
    column_mappings=cm, finetuning_provider_type=FinetuningProvider.AWS_SAGEMAKER
)

#### Curate a fine tuning dataset

You can create a dataset for fine tuning your GenAI application using several methods.

This notebook includes an example that gets high-quality data points from the Spanish-language data slice for your fine tuning dataset. The goal for this fine tuning is to achieve better chatbot responses to Spanish-language questions.

In [None]:
# Create a curated dataset from good examples from the Spanish-language slice
spanish_slice = Slice.get(dataset=app_name, slice="spanish")
df = sf.nodes.get_dataset(
    node=ft_app.model_node_uid,
    gt_label="ACCEPT",
    include_slice_uids=[spanish_slice.slice_uid],
)
good_slice_x_uids = df.index

Another method of curating a dataset for fine tuning is to use a quality model to select the highest-quality data points from the current dataset.

In [None]:
# Uncomment to run the dataset creation job
# # Create a curated dataset from a trained quality model
# # You must train a quality model in Snorkel Flow before running this cell

# qd = ft_app.get_quality_dataset(1) # Use the UID for your end model
# qd_filtered = qd.filter(confidence_threshold=0.9, labels=["ACCEPT"])
# good_x_uids = list(qd_filtered.get_data().index)
# good_datasource_uids = list(qd_filtered.get_data()['datasource_uid'].unique())
# good_datasource_uids = [int(i) for i in good_datasource_uids]

#### Run the fine tuning job

Run the cell below when you are ready to kick off the fine tuning job. For the purposes of this example notebook, instead of running your own fine-tuning job, deploying the fine-tuned model for your GenAI application, and collecting more response data, you can use an example dataset.

In [None]:
# Uncomment to run the fine tuning job
# external_model = external_model_trainer.finetune(
#     base_model_id="meta-textgeneration-llama-3-8b-instruct",
#     base_model_version="2.*",
#     finetuning_configs=finetuning_configs,
#     training_configs=training_configs,
#     datasource_uids= good_datasource_uids,
#     # to filter on x_uids, uncomment here
#     x_uids=good_x_uids,
#     # Set sync=False to return a job id and release the notebook kernel
#     sync=True
# )

### Add data from the second experiment to Snorkel Flow

This section shows how to add data from a new generation of your GenAI application to Snorkel Flow so you can create the next evaluation report.

In [None]:
# Upload data from the second generation of your GenAI app
fine_tuned_data = pd.read_csv("./eval-train-2.csv")

# Create a new experiment in Snorkel to track results
# This creates a new checkpoint for the evaluation graph
llama3_8B_v1 = "llama3-8B-v1"
experiment2 = ft_app.register_model_source(
    llama3_8B_v1, metadata=ModelSourceMetadata(model_name=llama3_8B_v1)
)["source_uid"]

ft_app.import_data(
    data=fine_tuned_data, split="train", source_uid=experiment2, sync=True
)

Expected output:

```
Successfully retrieved dataset evaluation-example-app with UID XXXX.
+0.59s Starting fine tuning data ingestion
+1.99s (100%) Data ingestion complete
Data ingestion complete
+0.59s Creating active datasource for op node
+1.29s (1%) Processing data source 1 / 1: starting.
+11.71s (1%) Processing data source 1 / 1: starting.
+21.71s (1%) Processing data source 1 / 1: starting.
+32.07s (1%) Processing data source 1 / 1: starting.
+42.47s (1%) Processing data source 1 / 1: starting.
+52.90s (1%) Processing data source 1 / 1: starting.
+63.44s (1%) Processing data source 1 / 1: starting.
+69.66s (90%) Active datasources preprocessing complete. Committing transactions. This can take several minutes.
+70.35s (100%) Transactions committed.

'rq-UezNqGLj_engine-u428_prep-and-ingest-fine-tuning-data'
```

In [None]:
# Reapply the slices
reapply_slices(ft_app.dataset_uid)

Expected output:

```
Applied 'verbose_question' slice (n=13, 4.21%)
Successfully added 13 datapoints to slice verbose_question.
Applied 'is_spanish' slice (n=9, 2.91%)
Successfully added 9 datapoints to slice spanish.
Retrieving slices for dataset XXXX.
Successfully reapplied slices to the dataset XXXX
```

### Evaluate the second generation of GenAI responses

Create a new evaluation report with the new experiment data.

In [None]:
# Run the evaluation report for the second experiment
ft_app.create_evaluation_report()

Expected output:

```
+0.59s Found and loaded datasources for models llama3-8B-v0, llama3-8B-v1 in dataset XXXX train split.
+10.67s Found and loaded datasources for models llama3-8B-v0, llama3-8B-v1 in dataset XXXX train split.

...

{'dataset_uid': 1932,

...

     'global': 100.0}}},
  'test': {}}}
```

In [None]:
# Add a description to the latest report
eval_report_uid = get_latest_report_uid(dataset_uid)
update_report_description(dataset_uid, eval_report_uid, "Fine tuning iteration")

Revisit the **Evaluate** page for your app in the Snorkel Flow app.

Now you should see two evaluation scores for each of your criteria, allowing you to visually identify trends in performance.

You can also toggle the option to **Compare models** to view performance deltas across experiments.

## Next steps

Keep iterating on your GenAI application until its performance across the evaluation criteria meets your standards for production.