Added evaluation script for topic relevance#74
Conversation
📝 WalkthroughWalkthroughAdded a new topic-relevance evaluator, refactored the ban-list evaluator to use in-code evaluation configs and reusable runner functions, expanded .gitignore to broader evaluation output patterns, updated the top-level evaluation runner script, and added accuracy to binary metrics. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip Migrating from UI to YAML configuration.Use the |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/tests/test_validate_with_guard.py (1)
6-9:⚠️ Potential issue | 🔴 CriticalMissing import causes CI failure:
_resolve_topic_relevance_scopeis not imported.The tests at lines 272-339 call
_resolve_topic_relevance_scope, but this function is not included in the import statement. This causes theNameErrorfailures in CI.🐛 Proposed fix to add the missing import
from app.api.routes.guardrails import ( _resolve_validator_configs, _validate_with_guard, + _resolve_topic_relevance_scope, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tests/test_validate_with_guard.py` around lines 6 - 9, The test failure is caused by a missing import for _resolve_topic_relevance_scope used in tests; update the import list from app.api.routes.guardrails to include _resolve_topic_relevance_scope alongside _resolve_validator_configs and _validate_with_guard so the symbol is available to the tests, then run the test suite to confirm the NameError is resolved.
🧹 Nitpick comments (2)
backend/scripts/run_all_evaluations.sh (1)
19-26: Consider a preflight existence check for each runner path.This gives a clearer error than a Python invocation failure when a path is wrong/moved.
Proposed refactor
for runner in "${RUNNERS[@]}"; do name="$(basename "$(dirname "$runner")")" + if [[ ! -f "$runner" ]]; then + echo "Missing runner: $runner" >&2 + exit 1 + fi + echo "" echo "==> [$name] $runner" uv run python "$runner" done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/scripts/run_all_evaluations.sh` around lines 19 - 26, Add a preflight existence/readability check for each runner inside the for loop: before running uv run python on "$runner" (the RUNNERS array and the loop using variable runner and name), verify the path exists and is a regular file/readable and, if not, print a clear error referencing $runner and exit non‑zero (or skip based on desired behavior) so the script fails fast with a helpful message instead of relying on the Python invocation to report a missing file.backend/app/evaluation/topic_relevance/run.py (1)
68-71: Consider simplifying the nested lambda for readability.The nested lambda structure is slightly convoluted:
lambda x: p.record(lambda t: validator.validate(t, metadata=None), x)A local function would be clearer, though this is a minor readability improvement.
♻️ Suggested refactor for clarity
+ def validate_input(text: str): + return validator.validate(text, metadata=None) + with Profiler() as p: - results = normalized_df["input"].apply( - lambda x: p.record(lambda t: validator.validate(t, metadata=None), x) - ) + results = normalized_df["input"].apply( + lambda x: p.record(validate_input, x) + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/topic_relevance/run.py` around lines 68 - 71, The nested lambda used in normalized_df["input"].apply is hard to read: currently it calls p.record with another lambda wrapping validator.validate. Replace the nested lambda with a small local helper function (e.g., validate_input or record_validation) that takes the input, calls p.record with validator.validate (passing metadata=None), and returns the result; then pass that helper to normalized_df["input"].apply so Profiler(), p.record, and validator.validate are used but the call site is clearer.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/evaluation/ban_list/run.py`:
- Around line 40-65: After loading dataset (dataset = pd.read_csv(DATASET_PATH))
validate required columns up front: ensure either "label" exists or both
"source_text" and "target_text" exist; if neither condition holds raise a clear
ValueError describing the missing columns. Place this check before any metric
derivation (before computing results/redacted_text/y_pred or run_ban_list
profiling) so the script fails fast; reference dataset, DATASET_PATH, and the
run_ban_list/Profiler block when adding the validation.
- Around line 42-48: The BanList is being constructed without an explicit
on_fail mode so it defaults to noop and FailResult.fix_value remains unset;
update the BanList initialization in run_ban_list to pass the intended on_fail
behavior (e.g., the same mode used by BanListSafetyValidatorConfig such as
"redact" or the appropriate enum/value) so failures produce a populated
FailResult.fix_value, and ensure the downstream variable (redacted_text) will
contain the redacted value from result.fix_value when result is a FailResult in
the run_ban_list function.
In `@backend/app/tests/test_validate_with_guard.py`:
- Around line 319-339: The test name and assertions are misleading because
_resolve_topic_relevance_scope currently only checks topic_relevance_config_id,
not whether a validator has an inline configuration; update the test to reflect
actual behavior by renaming it and asserting that when topic_relevance_config_id
is absent on the GuardrailRequest the CRUD get is not called (and conversely add
a test that sets topic_relevance_config_id to ensure mock_get is called),
referencing _resolve_topic_relevance_scope, GuardrailRequest,
validator.configuration, and topic_relevance_crud.get; alternatively, if you
intended runtime inline configuration to suppress lookups, modify
_resolve_topic_relevance_scope to check validator.get("configuration") and
short-circuit when present, then update or remove the duplicate tests
accordingly.
---
Outside diff comments:
In `@backend/app/tests/test_validate_with_guard.py`:
- Around line 6-9: The test failure is caused by a missing import for
_resolve_topic_relevance_scope used in tests; update the import list from
app.api.routes.guardrails to include _resolve_topic_relevance_scope alongside
_resolve_validator_configs and _validate_with_guard so the symbol is available
to the tests, then run the test suite to confirm the NameError is resolved.
---
Nitpick comments:
In `@backend/app/evaluation/topic_relevance/run.py`:
- Around line 68-71: The nested lambda used in normalized_df["input"].apply is
hard to read: currently it calls p.record with another lambda wrapping
validator.validate. Replace the nested lambda with a small local helper function
(e.g., validate_input or record_validation) that takes the input, calls p.record
with validator.validate (passing metadata=None), and returns the result; then
pass that helper to normalized_df["input"].apply so Profiler(), p.record, and
validator.validate are used but the call site is clearer.
In `@backend/scripts/run_all_evaluations.sh`:
- Around line 19-26: Add a preflight existence/readability check for each runner
inside the for loop: before running uv run python on "$runner" (the RUNNERS
array and the loop using variable runner and name), verify the path exists and
is a regular file/readable and, if not, print a clear error referencing $runner
and exit non‑zero (or skip based on desired behavior) so the script fails fast
with a helpful message instead of relying on the Python invocation to report a
missing file.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e7450187-141d-47cd-9aa6-ac2c7f81a2a2
📒 Files selected for processing (11)
.gitignorebackend/app/api/docs/topic_relevance_configs/create_topic_relevance_config.mdbackend/app/api/docs/topic_relevance_configs/delete_topic_relevance_config.mdbackend/app/api/docs/topic_relevance_configs/get_topic_relevance_config.mdbackend/app/api/docs/topic_relevance_configs/list_topic_relevance_configs.mdbackend/app/api/docs/topic_relevance_configs/update_topic_relevance_config.mdbackend/app/api/routes/guardrails.pybackend/app/evaluation/ban_list/run.pybackend/app/evaluation/topic_relevance/run.pybackend/app/tests/test_validate_with_guard.pybackend/scripts/run_all_evaluations.sh
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
backend/app/evaluation/ban_list/run.py (2)
40-65:⚠️ Potential issue | 🟡 MinorAdd fail-fast dataset schema validation before deriving metrics.
If required columns are absent, the script fails late (e.g., at Line 52 or Line 64) with less actionable errors. Validate required columns immediately after load.
Proposed fix
dataset = pd.read_csv(DATASET_PATH) + required = {"source_text"} + missing_required = required - set(dataset.columns) + if missing_required: + raise ValueError(f"Dataset missing required columns: {sorted(missing_required)}") + if "label" not in dataset.columns and "target_text" not in dataset.columns: + raise ValueError("Dataset must include either 'label' or 'target_text'.")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/ban_list/run.py` around lines 40 - 65, After loading the CSV into dataset (DATASET_PATH), perform a fail-fast schema validation that checks for required columns before running run_ban_list/Profiler: ensure "source_text" always exists, and either "label" exists or both "target_text" exists (so the y_true fallback can run); if any required column is missing, raise a clear ValueError or log an error listing the missing columns and exit early so downstream calls to dataset["source_text"], dataset["target_text"], or dataset["label"] do not fail with obscure errors.
42-48:⚠️ Potential issue | 🟠 MajorMake
BanListon_failexplicit to match redaction intent.At Line 42, the default
on_failmode may leaveFailResult.fix_valueunset, soredacted_textat Line 57 can end up being original text. Set an explicit mode aligned with intended behavior.Proposed fix
- validator = BanList(banned_words=banned_words) + validator = BanList(banned_words=banned_words, on_fail="fix")For the current guardrails.hub BanList validator, what are the default `on_fail` behavior and the valid `on_fail` modes that populate `FailResult.fix_value`?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/ban_list/run.py` around lines 42 - 48, The BanList validator is created without an explicit on_fail mode so FailResult.fix_value may be unset; change the BanList instantiation in run_ban_list to include an explicit redaction mode (e.g., set on_fail="redact") so that validator.validate(...) returns a FailResult with fix_value populated and redacted_text uses that value; update the BanList(...) call (and any related tests) to explicitly pass on_fail="redact" to ensure predictable redaction behavior.
🧹 Nitpick comments (1)
.gitignore (1)
12-13: Narrow dataset ignore patterns to avoid hiding source-of-truth assets.Ignoring
backend/app/evaluation/datasets/**/*.csvand**/*.txtcan silently block new/updated evaluation datasets and topic configs from being committed. That can break reproducibility for other contributors.Proposed change
-backend/app/evaluation/datasets/**/*.csv -backend/app/evaluation/datasets/**/*.txt +# Keep canonical evaluation datasets/configs tracked. +# Ignore only generated artifacts under outputs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.gitignore around lines 12 - 13, The .gitignore currently hides all CSV and TXT files under backend/app/evaluation/datasets/** which can prevent source-of-truth evaluation datasets and topic configs from being committed; update the ignore rules to only exclude truly generated or transient files (e.g., files under a generated/ or outputs/ subdirectory) and remove or scope the broad patterns backend/app/evaluation/datasets/**/*.csv and backend/app/evaluation/datasets/**/*.txt so that canonical dataset files are tracked; specifically, replace the global patterns with more targeted patterns (or add explicit negative exceptions) so functions/processes that rely on files in backend/app/evaluation/datasets (the dataset/topic config source) aren’t accidentally hidden.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/evaluation/topic_relevance/run.py`:
- Around line 50-63: Before creating normalized_df, validate that df contains
required columns ("input", "category", "scope") and that every value in
df["scope"] (after .astype(str).str.strip().upper()) is one of the allowed
labels (e.g., "IN_SCOPE" or "OUT_OF_SCOPE"); on any missing column or invalid
scope value, raise a clear Exception referencing dataset_path so the run fails
fast. In the TopicRelevance flow, normalize scope strings via
.str.strip().str.upper() and map "IN_SCOPE" -> 1 and "OUT_OF_SCOPE" -> 0 only
after validation (do not default other values to 0), and update the creation of
normalized_df accordingly so y_true cannot be silently corrupted.
---
Duplicate comments:
In `@backend/app/evaluation/ban_list/run.py`:
- Around line 40-65: After loading the CSV into dataset (DATASET_PATH), perform
a fail-fast schema validation that checks for required columns before running
run_ban_list/Profiler: ensure "source_text" always exists, and either "label"
exists or both "target_text" exists (so the y_true fallback can run); if any
required column is missing, raise a clear ValueError or log an error listing the
missing columns and exit early so downstream calls to dataset["source_text"],
dataset["target_text"], or dataset["label"] do not fail with obscure errors.
- Around line 42-48: The BanList validator is created without an explicit
on_fail mode so FailResult.fix_value may be unset; change the BanList
instantiation in run_ban_list to include an explicit redaction mode (e.g., set
on_fail="redact") so that validator.validate(...) returns a FailResult with
fix_value populated and redacted_text uses that value; update the BanList(...)
call (and any related tests) to explicitly pass on_fail="redact" to ensure
predictable redaction behavior.
---
Nitpick comments:
In @.gitignore:
- Around line 12-13: The .gitignore currently hides all CSV and TXT files under
backend/app/evaluation/datasets/** which can prevent source-of-truth evaluation
datasets and topic configs from being committed; update the ignore rules to only
exclude truly generated or transient files (e.g., files under a generated/ or
outputs/ subdirectory) and remove or scope the broad patterns
backend/app/evaluation/datasets/**/*.csv and
backend/app/evaluation/datasets/**/*.txt so that canonical dataset files are
tracked; specifically, replace the global patterns with more targeted patterns
(or add explicit negative exceptions) so functions/processes that rely on files
in backend/app/evaluation/datasets (the dataset/topic config source) aren’t
accidentally hidden.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 47a4c401-b8a4-4a49-9903-ebf13940fa63
📒 Files selected for processing (4)
.gitignorebackend/app/evaluation/ban_list/run.pybackend/app/evaluation/topic_relevance/run.pybackend/scripts/run_all_evaluations.sh
|
I feel like we should make a seperate readme for evaluation scripts because since making a dataset for each evaluation script depending on the validator is turning out to be differet, there should be a seperate md in evaluation folder that contains everything about evaluation |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
backend/app/evaluation/topic_relevance/run.py (2)
102-113: Minor inconsistency: usedataset_path.namefor consistency with ban_list.Line 107 uses
str(dataset_path)(full path), whileban_list/run.pyusesstr(DATASET_PATH.name)(filename only). Consider aligning for consistent report format.Proposed fix
- dataset=str(dataset_path), + dataset=str(dataset_path.name),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/topic_relevance/run.py` around lines 102 - 113, The report currently passes the full dataset path via str(dataset_path) into build_evaluation_report; change this to use dataset_path.name so the report uses the filename (consistent with ban_list's use of DATASET_PATH.name). Update the call site in the block that calls write_json(build_evaluation_report(..., dataset=str(dataset_path), ...), OUT_DIR / f"{domain}-metrics.json") to pass dataset=dataset_path.name instead, leaving other parameters (guardrail, num_samples, profiler p, llm_callable, prompt_schema_version, metrics) and the write_json/OUT_DIR usage unchanged.
73-76: Consider simplifying the nested lambda.The nested lambda
lambda x: p.record(lambda t: validator.validate(t, metadata=None), x)creates a new inner function object per row. This can be simplified by defining the validation function once.Proposed refactor
+ def validate_input(text: str): + return validator.validate(text, metadata=None) + with Profiler() as p: results = normalized_df["input"].apply( - lambda x: p.record(lambda t: validator.validate(t, metadata=None), x) + lambda x: p.record(validate_input, x) )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/topic_relevance/run.py` around lines 73 - 76, The apply creates a new inner lambda per row; instead define the validation callable once and pass it into p.record to avoid recreating functions repeatedly. In run.py, create a single validation function (e.g., with functools.partial or a named lambda) that calls validator.validate(..., metadata=None) and then call normalized_df["input"].apply(lambda x: p.record(validation_fn, x)) (or equivalent) so Profiler.p.record receives the pre-defined callable instead of constructing a new inner lambda for every row.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/app/evaluation/topic_relevance/run.py`:
- Around line 102-113: The report currently passes the full dataset path via
str(dataset_path) into build_evaluation_report; change this to use
dataset_path.name so the report uses the filename (consistent with ban_list's
use of DATASET_PATH.name). Update the call site in the block that calls
write_json(build_evaluation_report(..., dataset=str(dataset_path), ...), OUT_DIR
/ f"{domain}-metrics.json") to pass dataset=dataset_path.name instead, leaving
other parameters (guardrail, num_samples, profiler p, llm_callable,
prompt_schema_version, metrics) and the write_json/OUT_DIR usage unchanged.
- Around line 73-76: The apply creates a new inner lambda per row; instead
define the validation callable once and pass it into p.record to avoid
recreating functions repeatedly. In run.py, create a single validation function
(e.g., with functools.partial or a named lambda) that calls
validator.validate(..., metadata=None) and then call
normalized_df["input"].apply(lambda x: p.record(validation_fn, x)) (or
equivalent) so Profiler.p.record receives the pre-defined callable instead of
constructing a new inner lambda for every row.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ae72bfa2-b27a-40f9-a585-45bdcd7fba66
📒 Files selected for processing (3)
backend/app/evaluation/ban_list/run.pybackend/app/evaluation/common/helper.pybackend/app/evaluation/topic_relevance/run.py
There was a problem hiding this comment.
🧹 Nitpick comments (2)
backend/app/evaluation/topic_relevance/run.py (2)
51-57: Add file existence checks before reading.If the dataset or topic config files are missing, the current code will raise less informative errors. Consider validating file existence upfront.
Proposed improvement
dataset_path = DATASETS_DIR / config["dataset"] topic_config_path = DATASETS_DIR / config["topic_config"] + + if not dataset_path.exists(): + raise FileNotFoundError(f"Dataset not found: {dataset_path}") + if not topic_config_path.exists(): + raise FileNotFoundError(f"Topic config not found: {topic_config_path}") + topic_config = topic_config_path.read_text()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/topic_relevance/run.py` around lines 51 - 57, Before reading files, validate that dataset_path and topic_config_path exist (constructed from DATASETS_DIR and config["dataset"]/config["topic_config"]) and raise or log a clear error if missing; replace direct calls to topic_config_path.read_text() and pd.read_csv(dataset_path) with existence checks (e.g., if not dataset_path.exists(): raise FileNotFoundError(f"Dataset not found: {dataset_path}")) so the topic_config and dataframe loading in run.py produce informative messages including the domain and the missing path.
55-55: Consider usinglogginginstead ofStatic analysis flagged the
loggingwould provide better control over verbosity and align with production code standards.Also applies to: 117-117
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/topic_relevance/run.py` at line 55, Replace the module-level print calls (e.g., the print(f"\nRunning topic relevance evaluation: {domain}") and the other print at the later location) with standard logging calls: obtain a logger via logging.getLogger(__name__) at the top of backend/app/evaluation/topic_relevance/run.py, then use logger.info or logger.debug for progress messages and logger.error for errors; ensure the logger is configured (or rely on application logging configuration) instead of using print so verbosity can be controlled consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/app/evaluation/topic_relevance/run.py`:
- Around line 51-57: Before reading files, validate that dataset_path and
topic_config_path exist (constructed from DATASETS_DIR and
config["dataset"]/config["topic_config"]) and raise or log a clear error if
missing; replace direct calls to topic_config_path.read_text() and
pd.read_csv(dataset_path) with existence checks (e.g., if not
dataset_path.exists(): raise FileNotFoundError(f"Dataset not found:
{dataset_path}")) so the topic_config and dataframe loading in run.py produce
informative messages including the domain and the missing path.
- Line 55: Replace the module-level print calls (e.g., the print(f"\nRunning
topic relevance evaluation: {domain}") and the other print at the later
location) with standard logging calls: obtain a logger via
logging.getLogger(__name__) at the top of
backend/app/evaluation/topic_relevance/run.py, then use logger.info or
logger.debug for progress messages and logger.error for errors; ensure the
logger is configured (or rely on application logging configuration) instead of
using print so verbosity can be controlled consistently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a8be8840-34f7-4b83-b854-ad1fe97f02a8
📒 Files selected for processing (1)
backend/app/evaluation/topic_relevance/run.py
Summary
Target issue is #75.
Explain the motivation for making this change. What existing problem does the pull request solve?
This PR contains the following changes -
backend/app/evaluation/topic_relevance/run.pyhas some pre-configured topic configurations and datasets that we will be running to evaluate the validator performance.All testing datasets are present here - https://drive.google.com/drive/u/0/folders/1Rd1LH-oEwCkU0pBDRrYYedExorwmXA89
Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Notes
Please add here if any other information is required for the reviewer.
Summary by CodeRabbit
New Features
Improvements
Chores