# 8. Scorer Metrics

PyRIT includes metrics on scorers to evaluate the efficacy of different scorer configurations. They are caculated by comparing our automated scorers against human labeled datasets. You see these numbers automatically as part of Scenario output. This documentation goes through how to run new evaluations on scorers or to get existing metrics if they exist.

In [1]:
# Setup
from pyrit.prompt_target import OpenAIChatTarget
from pyrit.score import SelfAskRefusalScorer
from pyrit.setup import IN_MEMORY, initialize_pyrit_async

await initialize_pyrit_async(memory_db_type=IN_MEMORY)
target = OpenAIChatTarget()

Found default environment files: ['C:\\Users\\rlundeen\\.pyrit\\.env', 'C:\\Users\\rlundeen\\.pyrit\\.env.local']
Loaded environment file: C:\Users\rlundeen\.pyrit\.env
Loaded environment file: C:\Users\rlundeen\.pyrit\.env.local


## Running a Scorer Evaluation

The simplest way to evaluate a scorer is to call `evaluate_async()` on the scorer instance. A scorer has many parts to its identity that may influence the accuracy of a scorer. These include the following, which are part of the `ScorerIdentifier` class:

```python
    type: str
    system_prompt_template: Optional[str] = None
    user_prompt_template: Optional[str] = None
    sub_identifier: Optional[List[ScorerIdentifier]] = None
    target_info: Optional[Dict[str, Any]] = None
    score_aggregator: Optional[str] = None
    scorer_specific_params: Optional[Dict[str, Any]] = None
```

A `Scorer` also has an optional evaluation_file_mapping attribute. This configures what human datasets (and optionally harm categories) a `Scorer` will run against. Below is a snippet that runs an evaluation.

In [None]:
# Create and evaluate a refusal scorer
from pyrit.score.scorer_evaluation.scorer_evaluator import ScorerEvalDatasetFiles


refusal_scorer = SelfAskRefusalScorer(chat_target=target)

# Run evaluation - uses the scorer's default evaluation_file_mapping

# Real usage would run the evaluation on the whole dataset and save the results
# results = await refusal_scorer.evaluate_async()

# For demonstration, we will use a smaller evaluation file mapping
# run only one trial, and not add to the evaluation results

refusal_scorer.evaluation_file_mapping = [
    ScorerEvalDatasetFiles(
        human_labeled_datasets_files=["refusal_scorer/mini_refusal.csv"],
        result_file="refusal_scorer/test_refusal_evaluation_results.jsonl",
    )
]

results = await refusal_scorer.evaluate_async(num_scorer_trials=1, add_to_evaluation_results=False)

In [16]:
# Results is a dict mapping dataset name to metrics
for name, metrics in results.items():
    print(f"Dataset: {name}")
    print(f"  Accuracy: {metrics.accuracy:.2%}")
    print(f"  Precision: {metrics.precision:.3f}")
    print(f"  Recall: {metrics.recall:.3f}")
    print(f"  F1 Score: {metrics.f1_score:.3f}")

Dataset: refusal_evaluation_results
  Accuracy: 100.00%
  Precision: 1.000
  Recall: 1.000
  F1 Score: 1.000


## Retrieving Scorer metrics

Once scorer metrics for a given scorer are calculated with `evaluate_async`, those metrics are saved (and checked in). We have run quite a few and checked them in already; see `build_scripts\evaluate_scorers.py`. You can then retrieve them for any scorer using `get_scorer_metrics`, which is what you see when scenarios are run.

In [3]:
# Note these are not from our sample run above, they are retrieved from a full run that is checked in.

refusal_scorer2 = SelfAskRefusalScorer(chat_target=target)
metrics = refusal_scorer2.get_scorer_metrics()

print(metrics)

{'refusal_evaluation_results': ObjectiveScorerMetrics(num_responses=174, num_human_raters=1, num_scorer_trials=1, dataset_name='refusal_scorer/refusal_evaluation_results.jsonl', dataset_version='1.0', trial_scores=None, accuracy=1.0, accuracy_standard_error=0.0, f1_score=1.0, precision=1.0, recall=1.0)}


## Understanding the Metrics

For objective (true/false) scorers, the metrics are:

- **Accuracy**: Proportion of correct predictions
- **Precision**: Of all positive predictions, how many were correct
- **Recall**: Of all actual positives, how many were detected
- **F1 Score**: Harmonic mean of precision and recall
- **Accuracy Standard Error**: Statistical uncertainty in accuracy estimate

In [None]:
from dataclasses import asdict

# View all metrics for the first result
if results:
    first_metrics = list(results.values())[0]
    # Exclude trial_scores for cleaner output
    metrics_dict = {k: v for k, v in asdict(first_metrics).items() if k != 'trial_scores'}
    
    print("All metrics:")
    for key, value in metrics_dict.items():
        if isinstance(value, float):
            print(f"  {key}: {value:.4f}")
        else:
            print(f"  {key}: {value}")

## Checking Existing Metrics from Registry

The PyRIT team maintains a registry of scorer evaluation results on official datasets.
You can check if your scorer configuration has been evaluated:

In [None]:
# code that retrieves all objective scorer metrics and finds the top accuracy scorer