Conversation
This comment was marked as resolved.
This comment was marked as resolved.
* Validators: Added validator logs and organized code
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@backend/app/eval/common/metrics.py`:
- Around line 1-5: compute_binary_metrics currently uses zip(y_true, y_pred)
which silently truncates on length mismatch; update each zip call in
compute_binary_metrics (the lines computing tp, tn, fp, fn) to use zip(y_true,
y_pred, strict=True) so a ValueError is raised if lengths differ, preserving
correct metric calculations.
In `@backend/app/eval/lexical_slur/run.py`:
- Around line 35-40: The performance latency computation in run.py assumes
p.latencies is non-empty and will throw on empty datasets; update the
"performance" block to guard p.latencies (e.g., check if p.latencies truthy) and
only compute mean, p95 (sorted index), and max when there are values, otherwise
set those fields to a safe default such as None (or 0) so empty datasets don't
raise; locate the code using p.latencies in the "performance": {"latency_ms":
...} block and wrap or inline-conditional the mean, p95, and max calculations
accordingly.
In `@backend/app/eval/pii/entity_metrics.py`:
- Line 41: The loop pairing gold and predicted texts uses zip without strict
checking; update the loop "for gold_txt, pred_txt in zip(gold_texts,
pred_texts):" in entity_metrics.py to use strict=True (i.e., zip(gold_texts,
pred_texts, strict=True)) so a length mismatch raises immediately; ensure any
callers that rely on silent truncation are adjusted or tests updated if needed.
In `@backend/app/eval/pii/run.py`:
- Around line 29-31: The export currently writes raw text including potential
PII to predictions.csv; before calling write_csv(df, OUT_DIR /
"predictions.csv") remove or mask the source_text column (or any columns named
source_text, text, raw_text, etc.) from df, or gate inclusion behind an explicit
flag (e.g., a keep_raw_text boolean) that defaults to false; update the code
path that prepares df (the variable named df) so write_csv only receives non-PII
columns, and ensure the change is applied where write_csv and OUT_DIR /
"predictions.csv" are used to prevent accidental export of raw PII.
In `@backend/pyproject.toml`:
- Line 35: The unbounded "scikit-learn" entry in the pyproject.toml production
dependencies should be removed from the main dependencies list; either move the
"scikit-learn" line into dev-dependencies (or [project.optional-dependencies] if
used as an opt-in tool) or, if it truly must remain in production, replace the
unbounded string "scikit-learn" with a version-bounded specifier (e.g.,
"scikit-learn>=1.0,<2.0") to match the project's dependency pinning conventions;
update the dependencies section (the "scikit-learn" entry) accordingly.
🧹 Nitpick comments (1)
backend/scripts/test.sh (1)
6-6: Make the marker filter configurable for full-suite runs.Hard-coding the exclusion prevents running the full suite from this script and can skew coverage. Consider an env override with a sensible default.
♻️ Proposed tweak
+MARK_EXPR=${PYTEST_MARK_EXPR:-"not slow and not redteam"} -coverage run -m pytest app/tests -m "not slow and not redteam" +coverage run -m pytest app/tests -m "$MARK_EXPR"
This comment was marked as resolved.
This comment was marked as resolved.
backend/app/evaluation/datasets/lexical_slur_testing_dataset.csv
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@backend/app/evaluation/lexical_slur/run.py`:
- Around line 6-8: The imports at the top of run.py incorrectly reference
non-existent modules; replace them to import the utilities from helper.py by
importing compute_binary_metrics, Profiler, write_csv, and write_json from
app.evaluation.common.helper so any references to compute_binary_metrics,
Profiler, write_csv, and write_json in this file resolve correctly.
In `@backend/README.md`:
- Around line 106-115: The README has inconsistent output directory references:
some sections use "app/eval/outputs/" while the standardized structure and
scripts indicate "app/evaluation/outputs/"; update all occurrences of
"app/eval/outputs/" to "app/evaluation/outputs/" (including the lexical_slur and
pii_remover examples) so they match the BASE_DIR / "outputs" usage and the
standardized output structure shown earlier.
🧹 Nitpick comments (4)
backend/app/evaluation/pii/entity_metrics.py (1)
1-6: Use modern type hint syntax for Python 3.9+.The
typing.Iterable,typing.Dict, andtyping.Setare deprecated in favor ofcollections.abc.Iterableand built-indict/setgenerics.♻️ Proposed modernization
import re from collections import defaultdict -from typing import Iterable, Dict, Set +from collections.abc import Iterable # Matches placeholders like [PHONE_NUMBER], <IN_PAN>, etc. ENTITY_PATTERN = re.compile(r"[\[<]([A-Z0-9_]+)[\]>]")Then update function signatures:
-def extract_entities(text: str) -> Set[str]: +def extract_entities(text: str) -> set[str]:-def compare_entities(gold: Set[str], pred: Set[str]): +def compare_entities(gold: set[str], pred: set[str]):-def compute_entity_metrics( - gold_texts: Iterable[str], - pred_texts: Iterable[str], -) -> Dict[str, dict]: +def compute_entity_metrics( + gold_texts: Iterable[str], + pred_texts: Iterable[str], +) -> dict[str, dict]:-def finalize_entity_metrics(stats: Dict[str, dict]) -> Dict[str, dict]: +def finalize_entity_metrics(stats: dict[str, dict]) -> dict[str, dict]:backend/README.md (1)
107-115: Add language specifiers to code blocks.The fenced code blocks for directory structures should have a language specifier (e.g.,
textorplaintext) for better rendering and to satisfy markdown linting.📝 Proposed fix
Standardized output structure: -``` +```text app/evaluation/outputs/ lexical_slur/Apply similarly to code blocks at lines 119 and 129.
backend/app/evaluation/pii/run.py (2)
16-20: Avoid calling private_validatemethod directly.The
run_piifunction callsvalidator._validate(text)which is a private method. The lexical slur runner uses the publicvalidator.validate()API. For consistency and to avoid breaking if the internal API changes, use the public method.♻️ Proposed fix
def run_pii(text: str) -> str: - result = validator._validate(text) + result = validator.validate(text, metadata=None) if isinstance(result, FailResult): return result.fix_value return text
32-39: Consider adding performance profiling for consistency.The lexical slur runner includes latency and memory metrics, but the PII runner does not. For consistent benchmarking across validators, consider adding the same profiling.
♻️ Proposed addition
+from app.evaluation.common.profiling import Profiler + validator = PIIRemover() +with Profiler() as p: + df["anonymized"] = df["source_text"].astype(str).apply( + lambda x: p.record(run_pii, x) + ) -df["anonymized"] = df["source_text"].astype(str).apply(run_pii) # ... then in write_json: write_json( { "guardrail": "pii_remover", "num_samples": len(df), "entity_metrics": entity_report, + "performance": { + "latency_ms": {...}, # same pattern as lexical_slur + "memory_mb": p.peak_memory_mb, + }, }, OUT_DIR / "metrics.json", )
Summary
Target issue is #7.
Explain the motivation for making this change. What existing problem does the pull request solve?
This PR introduces a structured, offline evaluation framework for Guardrails validators, with support for metrics generation, performance profiling, and artifact export. We can benchmark validators like PII Remover and Lexical Slur Detection easily now.
For lexical slur match, ban list and gender assumption bias, testing doesn't make much sense cause these are deterministic. However, we curated a dataset for lexical slur match for use in toxicity detection validator that we have planned for this quarter.
Added standalone evaluation runners under app/evaluation/ for:
Evaluations run validators directly (no API / Guard orchestration), ensuring deterministic and fast benchmarks.
Each validator produces:
Standardized output structure:
How to test
Download the dataset from here. This contains multiple folders, one for each validator. Each folder contains a testing dataset in csv format for the validator. Download these csv files and store it in
backend/app/evaluation/datasets/folder. Once the datasets have been stored, we can run the evaluation script for each validator.Lexical Slur Validator Evaluation
Run the offline evaluation script:
python app/evaluation/lexical_slur/run.pyExpected outputs:
predictions.csv contains row-level inputs, predictions, and labels.
metrics.json contains binary classification metrics and performance stats
(latency + peak memory).
Run the PII evaluation script:
python app/evaluation/pii/run.pyExpected outputs:
predictions.csv contains original text, anonymized output, ground-truth masked text
metrics.json contains entity-level precision, recall, and F1 per PII type.
Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Notes
Please add here if any other information is required for the reviewer.
Summary by CodeRabbit
New Features
Documentation
Chores
✏️ Tip: You can customize this high-level summary in your review settings.