Added evaluation setup by rkritika1508 · Pull Request #17 · ProjectTech4DevAI/kaapi-guardrails

rkritika1508 · 2026-01-19T12:25:34Z

Summary

Target issue is #7.
Explain the motivation for making this change. What existing problem does the pull request solve?
This PR introduces a structured, offline evaluation framework for Guardrails validators, with support for metrics generation, performance profiling, and artifact export. We can benchmark validators like PII Remover and Lexical Slur Detection easily now.

For lexical slur match, ban list and gender assumption bias, testing doesn't make much sense cause these are deterministic. However, we curated a dataset for lexical slur match for use in toxicity detection validator that we have planned for this quarter.

Offline evaluation scripts
Added standalone evaluation runners under app/evaluation/ for:

Lexical slur validator
PII remover validator
Evaluations run validators directly (no API / Guard orchestration), ensuring deterministic and fast benchmarks.

Metrics & profiling utilities

Binary classification metrics (tp, fp, fn, tn, precision, recall, F1).
Entity-level precision/recall/F1 for PII redaction (placeholder-based evaluation).
Lightweight profiling: Per-sample latency (mean / p95 / max), Peak memory usage (via tracemalloc)

Downloadable evaluation artifacts
Each validator produces:

predictions.csv – row-level outputs for debugging and analysis
metrics.json – aggregated accuracy + performance metrics

Standardized output structure:

app/evaluation/outputs/
  lexical_slur/
    predictions.csv
    metrics.json
  pii_remover/
    predictions.csv
    metrics.json

Testing

Evaluation scripts tested locally on existing lexical slur and PII datasets.
Outputs validated manually (predictions.csv, metrics.json) for correctness.
No changes to production runtime paths or APIs.

How to test

Download the dataset from here. This contains multiple folders, one for each validator. Each folder contains a testing dataset in csv format for the validator. Download these csv files and store it in backend/app/evaluation/datasets/ folder. Once the datasets have been stored, we can run the evaluation script for each validator.
Lexical Slur Validator Evaluation
Run the offline evaluation script: python app/evaluation/lexical_slur/run.py

Expected outputs:

app/evaluation/outputs/lexical_slur/
├── predictions.csv
└── metrics.json

predictions.csv contains row-level inputs, predictions, and labels.
metrics.json contains binary classification metrics and performance stats
(latency + peak memory).

PII Remover Evaluation
Run the PII evaluation script: python app/evaluation/pii/run.py

Expected outputs:

app/evaluation/outputs/pii_remover/
├── predictions.csv
└── metrics.json

predictions.csv contains original text, anonymized output, ground-truth masked text
metrics.json contains entity-level precision, recall, and F1 per PII type.

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

New Features
- Evaluation test runners for PII removal and lexical slur detection, producing predictions CSV and metrics JSON with per-entity metrics, precision/recall/F1, and sample counts.
- Performance reporting including mean/p95/max latency and peak memory.
Documentation
- Added a "Running evaluation tests" guide with workflow, expected outputs, and example commands.
Chores
- Added scikit-learn dependency and ignored evaluation output files (predictions.csv, metrics.json).

_{✏️ Tip: You can customize this high-level summary in your review settings.}

* Validators: Added validator logs and organized code

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@backend/app/eval/common/metrics.py`:
- Around line 1-5: compute_binary_metrics currently uses zip(y_true, y_pred)
which silently truncates on length mismatch; update each zip call in
compute_binary_metrics (the lines computing tp, tn, fp, fn) to use zip(y_true,
y_pred, strict=True) so a ValueError is raised if lengths differ, preserving
correct metric calculations.

In `@backend/app/eval/lexical_slur/run.py`:
- Around line 35-40: The performance latency computation in run.py assumes
p.latencies is non-empty and will throw on empty datasets; update the
"performance" block to guard p.latencies (e.g., check if p.latencies truthy) and
only compute mean, p95 (sorted index), and max when there are values, otherwise
set those fields to a safe default such as None (or 0) so empty datasets don't
raise; locate the code using p.latencies in the "performance": {"latency_ms":
...} block and wrap or inline-conditional the mean, p95, and max calculations
accordingly.

In `@backend/app/eval/pii/entity_metrics.py`:
- Line 41: The loop pairing gold and predicted texts uses zip without strict
checking; update the loop "for gold_txt, pred_txt in zip(gold_texts,
pred_texts):" in entity_metrics.py to use strict=True (i.e., zip(gold_texts,
pred_texts, strict=True)) so a length mismatch raises immediately; ensure any
callers that rely on silent truncation are adjusted or tests updated if needed.

In `@backend/app/eval/pii/run.py`:
- Around line 29-31: The export currently writes raw text including potential
PII to predictions.csv; before calling write_csv(df, OUT_DIR /
"predictions.csv") remove or mask the source_text column (or any columns named
source_text, text, raw_text, etc.) from df, or gate inclusion behind an explicit
flag (e.g., a keep_raw_text boolean) that defaults to false; update the code
path that prepares df (the variable named df) so write_csv only receives non-PII
columns, and ensure the change is applied where write_csv and OUT_DIR /
"predictions.csv" are used to prevent accidental export of raw PII.

In `@backend/pyproject.toml`:
- Line 35: The unbounded "scikit-learn" entry in the pyproject.toml production
dependencies should be removed from the main dependencies list; either move the
"scikit-learn" line into dev-dependencies (or [project.optional-dependencies] if
used as an opt-in tool) or, if it truly must remain in production, replace the
unbounded string "scikit-learn" with a version-bounded specifier (e.g.,
"scikit-learn>=1.0,<2.0") to match the project's dependency pinning conventions;
update the dependencies section (the "scikit-learn" entry) accordingly.

🧹 Nitpick comments (1)

backend/scripts/test.sh (1)
6-6: Make the marker filter configurable for full-suite runs.

Hard-coding the exclusion prevents running the full suite from this script and can skew coverage. Consider an env override with a sensible default.
♻️ Proposed tweak
+MARK_EXPR=${PYTEST_MARK_EXPR:-"not slow and not redteam"}
-coverage run -m pytest app/tests -m "not slow and not redteam"
+coverage run -m pytest app/tests -m "$MARK_EXPR"

backend/app/eval/common/metrics.py

backend/app/evaluation/lexical_slur/run.py

backend/app/eval/pii/entity_metrics.py

backend/app/evaluation/pii/run.py

backend/pyproject.toml

backend/app/eval/common/io.py

backend/app/eval/common/metrics.py

backend/app/evaluation/datasets/lexical_slur_testing_dataset.csv

backend/app/tests/conftest.py

backend/app/tests/test_validate_with_guard.py

backend/app/evaluation/lexical_slur/run.py

backend/app/eval/common/io.py

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@backend/app/evaluation/lexical_slur/run.py`:
- Around line 6-8: The imports at the top of run.py incorrectly reference
non-existent modules; replace them to import the utilities from helper.py by
importing compute_binary_metrics, Profiler, write_csv, and write_json from
app.evaluation.common.helper so any references to compute_binary_metrics,
Profiler, write_csv, and write_json in this file resolve correctly.

In `@backend/README.md`:
- Around line 106-115: The README has inconsistent output directory references:
some sections use "app/eval/outputs/" while the standardized structure and
scripts indicate "app/evaluation/outputs/"; update all occurrences of
"app/eval/outputs/" to "app/evaluation/outputs/" (including the lexical_slur and
pii_remover examples) so they match the BASE_DIR / "outputs" usage and the
standardized output structure shown earlier.

🧹 Nitpick comments (4)

backend/app/evaluation/pii/entity_metrics.py (1)
1-6: Use modern type hint syntax for Python 3.9+.

The typing.Iterable, typing.Dict, and typing.Set are deprecated in favor of collections.abc.Iterable and built-in dict/set generics.
♻️ Proposed modernization
 import re
 from collections import defaultdict
-from typing import Iterable, Dict, Set
+from collections.abc import Iterable

 # Matches placeholders like [PHONE_NUMBER], <IN_PAN>, etc.
 ENTITY_PATTERN = re.compile(r"[\[<]([A-Z0-9_]+)[\]>]")
Then update function signatures:
-def extract_entities(text: str) -> Set[str]:
+def extract_entities(text: str) -> set[str]:
-def compare_entities(gold: Set[str], pred: Set[str]):
+def compare_entities(gold: set[str], pred: set[str]):
-def compute_entity_metrics(
-    gold_texts: Iterable[str],
-    pred_texts: Iterable[str],
-) -> Dict[str, dict]:
+def compute_entity_metrics(
+    gold_texts: Iterable[str],
+    pred_texts: Iterable[str],
+) -> dict[str, dict]:
-def finalize_entity_metrics(stats: Dict[str, dict]) -> Dict[str, dict]:
+def finalize_entity_metrics(stats: dict[str, dict]) -> dict[str, dict]:
backend/README.md (1)
107-115: Add language specifiers to code blocks.

The fenced code blocks for directory structures should have a language specifier (e.g., text or plaintext) for better rendering and to satisfy markdown linting.
📝 Proposed fix
 Standardized output structure:
-```
+```text
 app/evaluation/outputs/
   lexical_slur/
Apply similarly to code blocks at lines 119 and 129.
backend/app/evaluation/pii/run.py (2)
16-20: Avoid calling private _validate method directly.

The run_pii function calls validator._validate(text) which is a private method. The lexical slur runner uses the public validator.validate() API. For consistency and to avoid breaking if the internal API changes, use the public method.
♻️ Proposed fix
 def run_pii(text: str) -> str:
-    result = validator._validate(text)
+    result = validator.validate(text, metadata=None)
     if isinstance(result, FailResult):
         return result.fix_value
     return text
32-39: Consider adding performance profiling for consistency.

The lexical slur runner includes latency and memory metrics, but the PII runner does not. For consistent benchmarking across validators, consider adding the same profiling.
♻️ Proposed addition
+from app.evaluation.common.profiling import Profiler
+
 validator = PIIRemover()

+with Profiler() as p:
+    df["anonymized"] = df["source_text"].astype(str).apply(
+        lambda x: p.record(run_pii, x)
+    )
-df["anonymized"] = df["source_text"].astype(str).apply(run_pii)

 # ... then in write_json:
 write_json(
     {
         "guardrail": "pii_remover",
         "num_samples": len(df),
         "entity_metrics": entity_report,
+        "performance": {
+            "latency_ms": {...},  # same pattern as lexical_slur
+            "memory_mb": p.peak_memory_mb,
+        },
     },
     OUT_DIR / "metrics.json",
 )

backend/app/evaluation/lexical_slur/run.py

backend/README.md

Added testing setup

f503349

This comment was marked as resolved.

Sign in to view

rkritika1508 added 8 commits January 19, 2026 18:08

Added API metrics

2aaa996

Updated import ordering

c1c3ff4

Updated testing code

e14fb66

Updated testing code

c2c81c2

Added validator logs and organized code (#18)

dd3012c

* Validators: Added validator logs and organized code

Fixed testing setup

609e6a9

Merge branch 'main' into feat/testing-setup

c81b717

Refactored testing setup

c4d102c

rkritika1508 linked an issue Jan 27, 2026 that may be closed by this pull request

Validator: Add Validator Evaluation #7

Closed

rkritika1508 marked this pull request as ready for review January 27, 2026 11:28

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

AkhileshNegi and others added 2 commits January 27, 2026 20:30

Merge branch 'main' into feat/testing-setup

110caae

resolved comments

4311b5b

This comment was marked as resolved.

Sign in to view

rkritika1508 changed the title ~~Added testing setup~~ Added evaluation setup Jan 28, 2026

nishika26 reviewed Jan 28, 2026

View reviewed changes

rkritika1508 added 8 commits January 28, 2026 13:02

removed guardrails change

39de858

removed redundant changes

82e89ca

removed redundant code

ff35aa8

Revert conftest.py to main

543c664

Revert test_validate_with_guard.py to main

b846bcd

Revert test.sh to main

70dcfea

updated readme

c179807

improved readme format

ca3a11c

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

backend/app/evaluation/lexical_slur/run.py Outdated Show resolved Hide resolved

backend/README.md Show resolved Hide resolved

rkritika1508 added 3 commits January 28, 2026 13:33

fixed imports

5487703

updated readme

6da5fd5

removed testing datasets from github

cab5b86

nishika26 approved these changes Jan 28, 2026

View reviewed changes

rkritika1508 merged commit 0f8604c into main Jan 28, 2026
1 check passed

rkritika1508 deleted the feat/testing-setup branch January 28, 2026 10:27

coderabbitai bot mentioned this pull request Feb 20, 2026

Added gender assumption bias evaluation #59

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Added evaluation setup#17

Added evaluation setup#17
rkritika1508 merged 22 commits intomainfrom
feat/testing-setup

rkritika1508 commented Jan 19, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

This comment was marked as resolved.

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

rkritika1508 commented Jan 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Notes

Summary by CodeRabbit

Uh oh!

This comment was marked as resolved.

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rkritika1508 commented Jan 19, 2026 •

edited by coderabbitai bot

Loading