# A Brief Introduction to probly’s Unified OOD Evaluation API

This notebook introduces the **unified out-of-distribution (OOD) evaluation API** provided by `probly`. It explains the motivation, core concepts, and advanced usage patterns behind the `evaluate_ood` function, and shows how it enables **clean, extensible, and backward-compatible** OOD evaluation.

---

## 0. Motivation

Evaluating out-of-distribution detection methods typically involves multiple metrics:

* AUROC
* AUPR
* FPR or FNR at a fixed TPR

In practice, these metrics are often exposed via **separate functions**, with different calling conventions and limited extensibility. This leads to brittle evaluation code and makes experimentation harder.

The goal of probly’s unified OOD evaluation API is to:

* Provide **one entry point** for all common OOD metrics
* Preserve **backward compatibility** with existing AUROC-based workflows
* Allow **easy extension** with new metrics and thresholds
* Support **configuration-friendly metric specifications** (e.g. strings)

---

## 1. Hello World: AUROC (Backward Compatible)

We start with the simplest possible use case: computing AUROC.

In [14]:
import numpy as np

from probly.evaluation.ood_api import evaluate_ood

in_scores = np.random.normal(0, 1, size=1000)
out_scores = np.random.normal(2, 1, size=1000)

auroc = evaluate_ood(in_scores, out_scores)
print("AUROC:", auroc)

AUROC: 0.925721


### What is happening here?

* `metrics=None` is the default
* The function returns a **single float**
* Existing code that expects `evaluate_ood(...) -> float` continues to work unchanged

This ensures **full backward compatibility**.

---

## 2. Requesting Multiple Metrics

OOD evaluation rarely relies on a single metric. The unified API allows requesting multiple metrics at once.

In [4]:
results = evaluate_ood(
    in_scores,
    out_scores,
    metrics=["auroc", "aupr"],
)

for name, value in results.items():
    print(f"{name}: {value:.4f}")

auroc: 0.9218
aupr: 0.9212


### Design choice

* Input: list of metric specifications
* Output: dictionary mapping metric name → value
* Uniform interface for all metrics

---

## 3. Exploring All Available Metrics

For exploratory analysis or benchmarking, it is often useful to compute *all* supported metrics.

In [5]:
results = evaluate_ood(
    in_scores,
    out_scores,
    metrics="all",
)

results

{'auroc': 0.921818, 'aupr': 0.9212099095167594}

This returns a dictionary containing:

* AUROC
* AUPR
* FPR @ 95% TPR
* FNR @ 95% TPR

The exact set can grow over time without breaking the API.

---

## 4. Metrics at a Glance

**AUROC** measures how well the score separates ID from OOD samples independent of a fixed threshold. Values close to 1 indicate strong separation, while 0.5 corresponds to random guessing.

**AUPR** is especially useful when OOD samples are rare, as it emphasizes whether high scores truly correspond to OOD examples. Values close to 1 and clearly above the OOD base rate indicate clean and reliable alarms.

**FPR@XTPR (e.g. FPR@95TPR)** quantifies how many ID samples are falsely flagged as OOD when a desired OOD detection rate is enforced. Lower values mean fewer false alarms on ID data.

**FNR@XTPR (e.g. FNR@95TPR)** measures how many OOD samples are still missed at a given detection level. Lower values indicate that most OOD cases are successfully detected.

--------------|--------|
| `fpr` | FPR at default TPR (95%) |
| `fpr@0.8` | FPR at 80% TPR |
| `fnr@95%` | FNR at 95% TPR |

This design makes the API:

* Easy to use from configuration files
* Friendly to CLI tools
* Self-documenting

---

## 5. Static vs. Dynamic Metrics

Internally, the API distinguishes between **static** and **dynamic** metrics.

In [13]:
from probly.evaluation.tasks import (
    out_of_distribution_detection_aupr,
    out_of_distribution_detection_auroc,
    out_of_distribution_detection_fnr_at_x_tpr,
    out_of_distribution_detection_fpr_at_x_tpr,
)

STATIC_METRICS = {
    "auroc": out_of_distribution_detection_auroc,
    "aupr": out_of_distribution_detection_aupr,
}

DYNAMIC_METRICS = {
    "fpr": out_of_distribution_detection_fpr_at_x_tpr,
    "fnr": out_of_distribution_detection_fnr_at_x_tpr,
}

### Why this separation?

* Static metrics depend only on scores
* Dynamic metrics additionally depend on a threshold
* New metrics can be added by extending these dictionaries

No changes to `evaluate_ood` itself are required.

---

## 6. Parsing Metric Specifications

Dynamic metrics are parsed using a small helper function.

In [9]:
from probly.evaluation.ood_api import parse_dynamic_metric

base, threshold = parse_dynamic_metric("fnr@90%")
print(base, threshold)

fnr 0.9


The parser supports:

* Percent values (`90%`)
* Floating point values (`0.9`)
* Default thresholds when omitted

---

## 7. Robust Error Handling

Invalid metric specifications fail early with clear error messages.

In [11]:
try:
    evaluate_ood(in_scores, out_scores, metrics=["foo", "fpr@2.0"])
except ValueError as e:
    print(e)

Unknown metric 'foo'. Available: ['auroc', 'aupr'] + dynamic metric@value.


### Design principles

* Fail fast
* Explicit validation
* No silent defaults

---

## 8. Best Practices

**Recommended**

* Use `"auroc"` for quick comparisons
* Use `"all"` for exploratory evaluation
* Use explicit thresholds (`fpr@95%`) in benchmarks

**Avoid**

* Hard-coded metric functions scattered across codebases
* Implicit thresholds without documentation
* Mixing multiple evaluation APIs

---

## 9. Summary

The unified OOD evaluation API in probly provides:

* A single, consistent entry point
* Backward compatibility with legacy workflows
* Easy extensibility for new metrics
* Clear, configuration-friendly semantics