Skip to content

Comments

Added evaluation setup#17

Merged
rkritika1508 merged 22 commits intomainfrom
feat/testing-setup
Jan 28, 2026
Merged

Added evaluation setup#17
rkritika1508 merged 22 commits intomainfrom
feat/testing-setup

Conversation

@rkritika1508
Copy link
Collaborator

@rkritika1508 rkritika1508 commented Jan 19, 2026

Summary

Target issue is #7.
Explain the motivation for making this change. What existing problem does the pull request solve?
This PR introduces a structured, offline evaluation framework for Guardrails validators, with support for metrics generation, performance profiling, and artifact export. We can benchmark validators like PII Remover and Lexical Slur Detection easily now.

For lexical slur match, ban list and gender assumption bias, testing doesn't make much sense cause these are deterministic. However, we curated a dataset for lexical slur match for use in toxicity detection validator that we have planned for this quarter.

  1. Offline evaluation scripts
    Added standalone evaluation runners under app/evaluation/ for:
  • Lexical slur validator
  • PII remover validator
    Evaluations run validators directly (no API / Guard orchestration), ensuring deterministic and fast benchmarks.
  1. Metrics & profiling utilities
  • Binary classification metrics (tp, fp, fn, tn, precision, recall, F1).
  • Entity-level precision/recall/F1 for PII redaction (placeholder-based evaluation).
  • Lightweight profiling: Per-sample latency (mean / p95 / max), Peak memory usage (via tracemalloc)
  1. Downloadable evaluation artifacts
    Each validator produces:
  • predictions.csv – row-level outputs for debugging and analysis
  • metrics.json – aggregated accuracy + performance metrics

Standardized output structure:

app/evaluation/outputs/
  lexical_slur/
    predictions.csv
    metrics.json
  pii_remover/
    predictions.csv
    metrics.json
  1. Testing
  • Evaluation scripts tested locally on existing lexical slur and PII datasets.
  • Outputs validated manually (predictions.csv, metrics.json) for correctness.
  • No changes to production runtime paths or APIs.

How to test

  • Download the dataset from here. This contains multiple folders, one for each validator. Each folder contains a testing dataset in csv format for the validator. Download these csv files and store it in backend/app/evaluation/datasets/ folder. Once the datasets have been stored, we can run the evaluation script for each validator.

  • Lexical Slur Validator Evaluation
    Run the offline evaluation script: python app/evaluation/lexical_slur/run.py

Expected outputs:

app/evaluation/outputs/lexical_slur/
├── predictions.csv
└── metrics.json

predictions.csv contains row-level inputs, predictions, and labels.
metrics.json contains binary classification metrics and performance stats
(latency + peak memory).

  • PII Remover Evaluation
    Run the PII evaluation script: python app/evaluation/pii/run.py

Expected outputs:

app/evaluation/outputs/pii_remover/
├── predictions.csv
└── metrics.json

predictions.csv contains original text, anonymized output, ground-truth masked text
metrics.json contains entity-level precision, recall, and F1 per PII type.

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

  • New Features

    • Evaluation test runners for PII removal and lexical slur detection, producing predictions CSV and metrics JSON with per-entity metrics, precision/recall/F1, and sample counts.
    • Performance reporting including mean/p95/max latency and peak memory.
  • Documentation

    • Added a "Running evaluation tests" guide with workflow, expected outputs, and example commands.
  • Chores

    • Added scikit-learn dependency and ignored evaluation output files (predictions.csv, metrics.json).

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai

This comment was marked as resolved.

@rkritika1508 rkritika1508 linked an issue Jan 27, 2026 that may be closed by this pull request
@rkritika1508 rkritika1508 marked this pull request as ready for review January 27, 2026 11:28
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@backend/app/eval/common/metrics.py`:
- Around line 1-5: compute_binary_metrics currently uses zip(y_true, y_pred)
which silently truncates on length mismatch; update each zip call in
compute_binary_metrics (the lines computing tp, tn, fp, fn) to use zip(y_true,
y_pred, strict=True) so a ValueError is raised if lengths differ, preserving
correct metric calculations.

In `@backend/app/eval/lexical_slur/run.py`:
- Around line 35-40: The performance latency computation in run.py assumes
p.latencies is non-empty and will throw on empty datasets; update the
"performance" block to guard p.latencies (e.g., check if p.latencies truthy) and
only compute mean, p95 (sorted index), and max when there are values, otherwise
set those fields to a safe default such as None (or 0) so empty datasets don't
raise; locate the code using p.latencies in the "performance": {"latency_ms":
...} block and wrap or inline-conditional the mean, p95, and max calculations
accordingly.

In `@backend/app/eval/pii/entity_metrics.py`:
- Line 41: The loop pairing gold and predicted texts uses zip without strict
checking; update the loop "for gold_txt, pred_txt in zip(gold_texts,
pred_texts):" in entity_metrics.py to use strict=True (i.e., zip(gold_texts,
pred_texts, strict=True)) so a length mismatch raises immediately; ensure any
callers that rely on silent truncation are adjusted or tests updated if needed.

In `@backend/app/eval/pii/run.py`:
- Around line 29-31: The export currently writes raw text including potential
PII to predictions.csv; before calling write_csv(df, OUT_DIR /
"predictions.csv") remove or mask the source_text column (or any columns named
source_text, text, raw_text, etc.) from df, or gate inclusion behind an explicit
flag (e.g., a keep_raw_text boolean) that defaults to false; update the code
path that prepares df (the variable named df) so write_csv only receives non-PII
columns, and ensure the change is applied where write_csv and OUT_DIR /
"predictions.csv" are used to prevent accidental export of raw PII.

In `@backend/pyproject.toml`:
- Line 35: The unbounded "scikit-learn" entry in the pyproject.toml production
dependencies should be removed from the main dependencies list; either move the
"scikit-learn" line into dev-dependencies (or [project.optional-dependencies] if
used as an opt-in tool) or, if it truly must remain in production, replace the
unbounded string "scikit-learn" with a version-bounded specifier (e.g.,
"scikit-learn>=1.0,<2.0") to match the project's dependency pinning conventions;
update the dependencies section (the "scikit-learn" entry) accordingly.
🧹 Nitpick comments (1)
backend/scripts/test.sh (1)

6-6: Make the marker filter configurable for full-suite runs.

Hard-coding the exclusion prevents running the full suite from this script and can skew coverage. Consider an env override with a sensible default.

♻️ Proposed tweak
+MARK_EXPR=${PYTEST_MARK_EXPR:-"not slow and not redteam"}
-coverage run -m pytest app/tests -m "not slow and not redteam"
+coverage run -m pytest app/tests -m "$MARK_EXPR"

@nishika26

This comment was marked as resolved.

@rkritika1508 rkritika1508 changed the title Added testing setup Added evaluation setup Jan 28, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@backend/app/evaluation/lexical_slur/run.py`:
- Around line 6-8: The imports at the top of run.py incorrectly reference
non-existent modules; replace them to import the utilities from helper.py by
importing compute_binary_metrics, Profiler, write_csv, and write_json from
app.evaluation.common.helper so any references to compute_binary_metrics,
Profiler, write_csv, and write_json in this file resolve correctly.

In `@backend/README.md`:
- Around line 106-115: The README has inconsistent output directory references:
some sections use "app/eval/outputs/" while the standardized structure and
scripts indicate "app/evaluation/outputs/"; update all occurrences of
"app/eval/outputs/" to "app/evaluation/outputs/" (including the lexical_slur and
pii_remover examples) so they match the BASE_DIR / "outputs" usage and the
standardized output structure shown earlier.
🧹 Nitpick comments (4)
backend/app/evaluation/pii/entity_metrics.py (1)

1-6: Use modern type hint syntax for Python 3.9+.

The typing.Iterable, typing.Dict, and typing.Set are deprecated in favor of collections.abc.Iterable and built-in dict/set generics.

♻️ Proposed modernization
 import re
 from collections import defaultdict
-from typing import Iterable, Dict, Set
+from collections.abc import Iterable

 # Matches placeholders like [PHONE_NUMBER], <IN_PAN>, etc.
 ENTITY_PATTERN = re.compile(r"[\[<]([A-Z0-9_]+)[\]>]")

Then update function signatures:

-def extract_entities(text: str) -> Set[str]:
+def extract_entities(text: str) -> set[str]:
-def compare_entities(gold: Set[str], pred: Set[str]):
+def compare_entities(gold: set[str], pred: set[str]):
-def compute_entity_metrics(
-    gold_texts: Iterable[str],
-    pred_texts: Iterable[str],
-) -> Dict[str, dict]:
+def compute_entity_metrics(
+    gold_texts: Iterable[str],
+    pred_texts: Iterable[str],
+) -> dict[str, dict]:
-def finalize_entity_metrics(stats: Dict[str, dict]) -> Dict[str, dict]:
+def finalize_entity_metrics(stats: dict[str, dict]) -> dict[str, dict]:
backend/README.md (1)

107-115: Add language specifiers to code blocks.

The fenced code blocks for directory structures should have a language specifier (e.g., text or plaintext) for better rendering and to satisfy markdown linting.

📝 Proposed fix
 Standardized output structure:
-```
+```text
 app/evaluation/outputs/
   lexical_slur/

Apply similarly to code blocks at lines 119 and 129.

backend/app/evaluation/pii/run.py (2)

16-20: Avoid calling private _validate method directly.

The run_pii function calls validator._validate(text) which is a private method. The lexical slur runner uses the public validator.validate() API. For consistency and to avoid breaking if the internal API changes, use the public method.

♻️ Proposed fix
 def run_pii(text: str) -> str:
-    result = validator._validate(text)
+    result = validator.validate(text, metadata=None)
     if isinstance(result, FailResult):
         return result.fix_value
     return text

32-39: Consider adding performance profiling for consistency.

The lexical slur runner includes latency and memory metrics, but the PII runner does not. For consistent benchmarking across validators, consider adding the same profiling.

♻️ Proposed addition
+from app.evaluation.common.profiling import Profiler
+
 validator = PIIRemover()

+with Profiler() as p:
+    df["anonymized"] = df["source_text"].astype(str).apply(
+        lambda x: p.record(run_pii, x)
+    )
-df["anonymized"] = df["source_text"].astype(str).apply(run_pii)

 # ... then in write_json:
 write_json(
     {
         "guardrail": "pii_remover",
         "num_samples": len(df),
         "entity_metrics": entity_report,
+        "performance": {
+            "latency_ms": {...},  # same pattern as lexical_slur
+            "memory_mb": p.peak_memory_mb,
+        },
     },
     OUT_DIR / "metrics.json",
 )

@rkritika1508 rkritika1508 merged commit 0f8604c into main Jan 28, 2026
1 check passed
@rkritika1508 rkritika1508 deleted the feat/testing-setup branch January 28, 2026 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Validator: Add Validator Evaluation

3 participants