# 8. Scorer Metrics

PyRIT includes an evaluation framework to measure how well automated scorers align with human judgment. This is critical for understanding scorer reliability before using them in production scenarios.

**Why evaluate scorers?**
- Different scorer configurations (prompts, models, temperatures) produce different results
- Metrics help you choose the best configuration for your use case. And helps us create better scorers
- Understanding accuracy/precision/recall trade-offs guides how you interpret scorer outputs

**Two types of scorer evaluations:**
1. **Objective scorers** (true/false): Measured with accuracy, precision, recall, F1 score
2. **Harm scorers** (Likert scale 0.0-1.0): Measured with mean absolute error, t-test, Krippendorff's alpha

This notebook covers how to run evaluations, interpret metrics, retrieve cached results, and iterate on scorer configurations.

## Objective Scorer Evaluation

Objective scorers produce true/false outputs, usually whether an objective was met (e.g. did this response have instructions on "how to make a Molotov cocktail?). We evaluate them using standard classification metrics by comparing model predictions against human-labeled ground truth.

### Understanding Objective Metrics

- **Accuracy**: Proportion of predictions matching human labels. Simple but can be misleading with imbalanced datasets.
- **Precision**: Of all "true" predictions, how many were correct? High precision = few false positives.
- **Recall**: Of all actual "true" cases, how many did we catch? High recall = few false negatives.
- **F1 Score**: Harmonic mean of precision and recall. Balances both concerns.
- **Accuracy Standard Error**: Statistical uncertainty in accuracy estimate, useful for confidence intervals.

**Which metric matters most?**
- If false positives are costly (e.g., flagging safe content as harmful) → prioritize **precision**
- If false negatives are costly (e.g., missing actual jailbreaks) → prioritize **recall**
- For balanced scenarios → use **F1 score**

### Running an Objective Evaluation

Call `evaluate_async()` on any scorer instance. The scorer's identity (including system prompt, model, temperature) determines which cached results apply.

In [1]:

from pyparsing import cast
from pyrit.prompt_target import OpenAIChatTarget
from pyrit.score import SelfAskRefusalScorer
from pyrit.score.scorer_evaluation.scorer_evaluator import ScorerEvalDatasetFiles
from pyrit.score.scorer_evaluation.scorer_metrics import ObjectiveScorerMetrics
from pyrit.setup import IN_MEMORY, initialize_pyrit_async

await initialize_pyrit_async(memory_db_type=IN_MEMORY)

# Create a refusal scorer - uses the chat target to determine if responses are refusals
refusal_scorer = SelfAskRefusalScorer(chat_target=OpenAIChatTarget())


# REAL usage would simply be the next line
# but we're configuring a smaller eval for demo purposes
# metrics = await refusal_scorer.evaluate_async()

# For demonstration, use a smaller evaluation file (normally you'd use the full dataset)
# The evaluation_file_mapping tells the evaluator which human-labeled CSV files to use
refusal_scorer.evaluation_file_mapping = ScorerEvalDatasetFiles(
    human_labeled_datasets_files=["refusal_scorer/mini_refusal.csv"],
    result_file="refusal_scorer/test_refusal_evaluation_results.jsonl",
)
# Run evaluation with:
# - num_scorer_trials=1: Score each response once (use 3+ for production to measure variance)
# - add_to_evaluation_results=False: Don't save to the official registry (for testing only)
metrics = await refusal_scorer.evaluate_async(num_scorer_trials=1, add_to_evaluation_results=False)

if metrics:
    objective_metrics = cast(ObjectiveScorerMetrics, metrics)
    print ("Evaluation Metrics:")
    print (f"Dataset Name: {objective_metrics.dataset_name}")
    print (f"Dataset Version: {objective_metrics.dataset_version}")
    print (f"F1 Score: {objective_metrics.f1_score}")
    print (f"Accuracy: {objective_metrics.accuracy}")
    print (f"Precision: {objective_metrics.precision}")
    print (f"Recall: {objective_metrics.recall}")
else:
    raise RuntimeError("Evaluation failed, no metrics returned")

Found default environment files: ['C:\\Users\\rlundeen\\.pyrit\\.env', 'C:\\Users\\rlundeen\\.pyrit\\.env.local']
Loaded environment file: C:\Users\rlundeen\.pyrit\.env
Loaded environment file: C:\Users\rlundeen\.pyrit\.env.local
Evaluation Metrics:
Dataset Name: refusal_scorer/test_refusal_evaluation_results.jsonl
Dataset Version: 1.0
F1 Score: 1.0
Accuracy: 1.0
Precision: 1.0
Recall: 1.0


## Harm Scorer Evaluation

Harm scorers produce float scores (0.0-1.0) representing severity. Since these are continuous values, we use different metrics that capture how close the model's scores are to human judgments.

### Understanding Harm Metrics

**Error Metrics:**
- **Mean Absolute Error (MAE)**: Average absolute difference between model and human scores. An MAE of 0.15 means the model is off by ~15% on average.
- **MAE Standard Error**: Uncertainty in the MAE estimate.

**Statistical Significance:**
- **t-statistic**: From a one-sample t-test. Positive = model scores higher than humans; negative = lower.
- **p-value**: If < 0.05, the difference between model and human scores is statistically significant (not due to chance).

**Inter-Rater Reliability (Krippendorff's Alpha):**
Measures agreement between evaluators, ranging from -1.0 to 1.0:
- **1.0**: Perfect agreement
- **0.8+**: Strong agreement (typically acceptable for production)
- **0.6-0.8**: Moderate agreement
- **< 0.6**: Weak agreement (scorer may need tuning)

Three alpha values are reported:
- **`krippendorff_alpha_humans`**: Agreement among human evaluators (baseline quality of labels)
- **`krippendorff_alpha_model`**: Agreement across multiple scoring trials (model consistency)
- **`krippendorff_alpha_combined`**: Overall agreement between humans and model

In [None]:
from pyrit.score import SelfAskLikertScorer, LikertScalePaths

# Create a harm scorer using the hate speech Likert scale
likert_scorer = SelfAskLikertScorer(
    chat_target=OpenAIChatTarget(),
    likert_scale_path=LikertScalePaths.HATE_SPEECH_SCALE.value
)

# Configure evaluation to use a small sample dataset
likert_scorer.evaluation_file_mapping = ScorerEvalDatasetFiles(
        human_labeled_datasets_files=["harm/mini_hate_speech.csv"],
        result_file="harm/test_hate_speech_evaluation_results.jsonl",
        harm_category="hate_speech",  # Required for harm evaluations
    )

harm_results = await likert_scorer.evaluate_async(num_scorer_trials=1, add_to_evaluation_results=False)

for name, metrics in harm_results.items():
    print(f"Dataset: {name}")
    print(f"  MAE:                {metrics.mean_absolute_error:.3f}")
    print(f"  MAE Std Error:      {metrics.mae_standard_error:.4f}")
    print(f"  t-statistic:        {metrics.t_statistic:.3f}")
    print(f"  p-value:            {metrics.p_value:.4f}")
    print(f"  Krippendorff Alpha: {metrics.krippendorff_alpha_combined:.3f}")

Dataset: test_hate_speech_evaluation_results
  MAE:                0.167
  MAE Std Error:      0.0314
  t-statistic:        -0.612
  p-value:            0.5501
  Krippendorff Alpha: 0.678


## Retrieving Cached Metrics

Once scorer metrics are calculated with `evaluate_async()`, they're saved to JSONL registry files and can be retrieved without re-running the evaluation. The PyRIT team has pre-computed metrics for common scorer configurations which you can access immediately.

Use `get_scorer_metrics()` to retrieve cached results for a scorer configuration:

In [3]:
# Create a new scorer instance with the same configuration
# get_scorer_metrics() looks up results by scorer identity hash
refusal_scorer_for_lookup = SelfAskRefusalScorer(chat_target=OpenAIChatTarget())

# Retrieve pre-computed metrics (from PyRIT team's evaluation runs)
cached_metrics = refusal_scorer_for_lookup.get_scorer_metrics()

if cached_metrics:
    for dataset_name, metrics in cached_metrics.items():
        print(f"Cached results for: {dataset_name}")
        print(f"  Accuracy:  {metrics.accuracy:.2%}")
        print(f"  F1 Score:  {metrics.f1_score:.3f}")
        print(f"  Precision: {metrics.precision:.3f}")
        print(f"  Recall:    {metrics.recall:.3f}")
else:
    print("No cached metrics found for this scorer configuration.")

Cached results for: refusal_evaluation_results
  Accuracy:  100.00%
  F1 Score:  1.000
  Precision: 1.000
  Recall:    1.000


## Scorer Identity and Caching

Evaluation results are cached by a hash of the scorer's full identity, which includes:
- Scorer type (e.g., `SelfAskRefusalScorer`)
- System and user prompt templates
- Target model information (endpoint, model name)
- Temperature and other generation parameters
- Any scorer-specific configuration

This means changing *any* of these values creates a new scorer identity, requiring fresh evaluation. It is important to note that the reason these are variables are because these are things that _might_ change the performance. Does changing the temperature increase or decrease accuracy? This allows users to experiment.

In [None]:
# View the scorer's full identity - this determines the metrics retrieved using get_scorer_metrics()
scorer_identity = refusal_scorer.scorer_identifier
print("Scorer Identity:")
print(f"  Type: {scorer_identity.scorer_type}")
print(f"  System Prompt Hash: {scorer_identity.system_prompt_template[:50] if scorer_identity.system_prompt_template else 'None'}...")
print(f"  Target Info: {scorer_identity.target_info}")
print(f"  Identity Hash: {scorer_identity.compute_hash()}")

## Comparing Scorer Configurations

The evaluation registry stores metrics for all tested scorer configurations. You can load all entries to compare which configurations perform best for different metrics.

Use `get_all_objective_metrics()` or `get_all_harm_metrics()` to load evaluation results. These return properly typed `ScorerMetricsWithIdentity` objects with clean attribute access to both the scorer identity and its metrics.

In [2]:
from pyrit.score import get_all_objective_metrics

# Load all objective scorer metrics - returns ScorerMetricsWithIdentity[ObjectiveScorerMetrics]
all_scorers = get_all_objective_metrics()

print(f"Found {len(all_scorers)} scorer configurations in the registry\n")

# Sort by F1 score - type checker knows entry.metrics is ObjectiveScorerMetrics
sorted_by_f1 = sorted(all_scorers, key=lambda x: x.metrics.f1_score, reverse=True)

print("Top 5 configurations by F1 Score:")
print("-" * 80)
for i, entry in enumerate(sorted_by_f1[:5], 1):
    m = entry.metrics  # ObjectiveScorerMetrics - no cast needed!
    print(f"\n{i}. F1={m.f1_score:.3f}  Accuracy={m.accuracy:.2%}  Precision={m.precision:.3f}  Recall={m.recall:.3f}")
    print(f"   Scorer Type: {entry.scorer_identifier.type}")

# Show best by each metric
print("\n" + "=" * 80)
print(f"Best Accuracy:  {max(all_scorers, key=lambda x: x.metrics.accuracy).metrics.accuracy:.2%}")
print(f"Best Precision: {max(all_scorers, key=lambda x: x.metrics.precision).metrics.precision:.3f}")
print(f"Best Recall:    {max(all_scorers, key=lambda x: x.metrics.recall).metrics.recall:.3f}")

Found 6 scorer configurations in the registry

Top 5 configurations by F1 Score:
--------------------------------------------------------------------------------

1. F1=0.814  Accuracy=78.38%  Precision=0.714  Recall=0.946
   Scorer: {'__type__': 'TrueFalseInverterScorer', 'system_prompt_template': None, 'user_prompt_template': None, 'sub_identifier': [{'__type__': 'SelfAskRefusalScorer', 'system_prompt_template': 'sha256:8e8ddfa0aa41578a', 'user_prompt_template': None, 'sub_identifier': None, 'target_info': {'__type__': 'OpenAIChatTarget', 'model_name': 'gpt-4o-unsafe', 'temperature': 0.9}, 'score_aggregator': 'OR_', 'scorer_specific_params': None}], 'target_info': None, 'score_aggregator': 'OR_', 'scorer_specific_params': None, 'hash': 'a13631702f000907fa0b7b6a0f62b05aed9effe291674448a958b34658893575'}

2. F1=0.802  Accuracy=77.03%  Precision=0.704  Recall=0.932
   Scorer: {'__type__': 'TrueFalseInverterScorer', 'system_prompt_template': None, 'user_prompt_template': None, 'sub_ident

## Custom Evaluation Datasets

You can evaluate scorers against your own human-labeled datasets using `ScorerEvalDatasetFiles`:

```python
# Configure custom datasets
scorer.evaluation_file_mapping = [
    ScorerEvalDatasetFiles(
        # Glob patterns relative to SCORER_EVALS_PATH
        human_labeled_datasets_files=["my_datasets/*.csv"],
        # Where to store results
        result_file="my_datasets/evaluation_results.jsonl",
        # Required for harm scorers, ignored for objective scorers
        harm_category="violence",
    )
]
```

### CSV Format

**Required columns:**
- `assistant_response`: The text to score
- `objective`: Required for **objective scorers** - the scoring question (e.g., "Did the response refuse?")
- `harm_category`: Required for **harm scorers** - the harm category (e.g., "hate_speech", "violence")
- `human_score` or `human_score_1`, `human_score_2`, etc.: Ground truth labels from human raters
  - For objective scorers: 0 or 1 (converted to bool)
  - For harm scorers: 0.0-1.0 float values
- `data_type`: Type of content (defaults to "text")