Added evaluation script for topic relevance by rkritika1508 · Pull Request #74 · ProjectTech4DevAI/kaapi-guardrails

rkritika1508 · 2026-03-06T07:06:50Z

Summary

Target issue is #75.
Explain the motivation for making this change. What existing problem does the pull request solve?
This PR contains the following changes -

Added code to evaluate topic relevance validator. backend/app/evaluation/topic_relevance/run.py has some pre-configured topic configurations and datasets that we will be running to evaluate the validator performance.
Made the ban-list evaluation consistent with topic relevance validator as well. Now, we have a basic framework while writing all evaluation scripts for validators.
Updated the shell script to add topic relevance evaluator.

All testing datasets are present here - https://drive.google.com/drive/u/0/folders/1Rd1LH-oEwCkU0pBDRrYYedExorwmXA89

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

New Features
- Added topic-relevance evaluations for education and healthcare with per-category metrics and profiling.
- Ban-list evaluations now run per configuration and produce separate prediction and metrics outputs.
Improvements
- Metrics now include overall accuracy alongside existing binary scores.
- Evaluation runner output formatting improved for clarity.
Chores
- Broadened ignore patterns for evaluation datasets and outputs in .gitignore.
- Runner no longer accepts inline KEY=VALUE environment assignments.

coderabbitai · 2026-03-06T07:07:09Z

📝 Walkthrough

Walkthrough

Added a new topic-relevance evaluator, refactored the ban-list evaluator to use in-code evaluation configs and reusable runner functions, expanded .gitignore to broader evaluation output patterns, updated the top-level evaluation runner script, and added accuracy to binary metrics.

Changes

Cohort / File(s)	Summary
Configuration and Ignore Patterns `\.gitignore`	Replaced specific ignore entries (`metrics.json`, `predictions.csv`, dataset CSVs) with recursive patterns to ignore CSV/TXT under `backend/app/evaluation/datasets//` and CSV/JSON under `backend/app/evaluation/outputs//`.
Ban List Evaluator Refactoring `backend/app/evaluation/ban_list/run.py`	Removed env-var-driven flow; added `BAN_LIST_EVALUATIONS` list, `run_evaluation(config)` and `main()`; changed outputs to per-config `"{name}-predictions.csv"` / `"{name}-metrics.json`; added start/completion logs.
Topic Relevance Evaluator `backend/app/evaluation/topic_relevance/run.py`	New module implementing domain evaluations (`education`, `healthcare`): loads datasets/topic configs, normalizes inputs, runs `TopicRelevance` with `Profiler`, computes overall and per-category binary metrics, writes per-domain predictions and metrics (with profiler).
Evaluation Runner Script `backend/scripts/run_all_evaluations.sh`	Removed KEY=VALUE argument export handling; added `topic_relevance/run.py` to runner list; adjusted console banner spacing.
Common Metrics Helper `backend/app/evaluation/common/helper.py`	`compute_binary_metrics(y_true, y_pred)` now returns `accuracy` (rounded to 2 decimals) in addition to existing confusion/score metrics.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Runner as Evaluation Runner (sh)
participant Script as Eval Script (Python)
participant Validator as Validator (BanList / TopicRelevance)
participant Profiler as Profiler
participant FS as Filesystem (datasets/outputs)
Runner->>Script: invoke runner script -> python eval script
Script->>FS: read dataset CSV & (optional) topic config
Script->>Validator: construct validator, call validate(input)
Validator-->>Script: validation result (pass/fail + meta)
Script->>Profiler: record timing/metadata
Script->>FS: write predictions CSV ({name}-predictions.csv)
Script->>FS: write metrics JSON ({name}-metrics.json with profiler & per-category metrics)
Script-->>Runner: completion log

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Added ban list evaluation #62: Modifies the ban-list evaluator flow that this PR refactors into in-code evaluations and per-config outputs.
Added evaluation setup #17: Related evaluation tooling changes touching runners, outputs, and .gitignore patterns.
Added gender assumption bias evaluation #59: Overlaps with updates to evaluation utilities (e.g., compute_binary_metrics) and runner scripts.

Suggested reviewers

nishika26
AkhileshNegi

Poem

🐰 I hopped through datasets, files in tow,
Banned words tumbled, topics in a row,
Scripts now run with tidy names,
Metrics waltz and profiling frames,
Rabbity cheers for outputs that grow!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 53.85% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the primary change: adding a new evaluation script for topic relevance.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/topic-relevance-evaluation

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

Migrating from UI to YAML configuration.

Use the @coderabbitai configuration command in a PR comment to get a dump of all your UI settings in YAML format. You can then edit this YAML file and upload it to the root of your repository to configure CodeRabbit programmatically.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/tests/test_validate_with_guard.py (1)
6-9: ⚠️ Potential issue | 🔴 Critical

Missing import causes CI failure: _resolve_topic_relevance_scope is not imported.

The tests at lines 272-339 call _resolve_topic_relevance_scope, but this function is not included in the import statement. This causes the NameError failures in CI.
🐛 Proposed fix to add the missing import
 from app.api.routes.guardrails import (
     _resolve_validator_configs,
     _validate_with_guard,
+    _resolve_topic_relevance_scope,
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/test_validate_with_guard.py` around lines 6 - 9, The test
failure is caused by a missing import for _resolve_topic_relevance_scope used in
tests; update the import list from app.api.routes.guardrails to include
_resolve_topic_relevance_scope alongside _resolve_validator_configs and
_validate_with_guard so the symbol is available to the tests, then run the test
suite to confirm the NameError is resolved.

🧹 Nitpick comments (2)

backend/scripts/run_all_evaluations.sh (1)

19-26: Consider a preflight existence check for each runner path.

This gives a clearer error than a Python invocation failure when a path is wrong/moved.

Proposed refactor

 for runner in "${RUNNERS[@]}"; do
   name="$(basename "$(dirname "$runner")")"
 
+  if [[ ! -f "$runner" ]]; then
+    echo "Missing runner: $runner" >&2
+    exit 1
+  fi
+
   echo ""
   echo "==> [$name] $runner"
 
   uv run python "$runner"
 done

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/scripts/run_all_evaluations.sh` around lines 19 - 26, Add a preflight
existence/readability check for each runner inside the for loop: before running
uv run python on "$runner" (the RUNNERS array and the loop using variable runner
and name), verify the path exists and is a regular file/readable and, if not,
print a clear error referencing $runner and exit non‑zero (or skip based on
desired behavior) so the script fails fast with a helpful message instead of
relying on the Python invocation to report a missing file.

backend/app/evaluation/topic_relevance/run.py (1)

68-71: Consider simplifying the nested lambda for readability.

The nested lambda structure is slightly convoluted:

lambda x: p.record(lambda t: validator.validate(t, metadata=None), x)

A local function would be clearer, though this is a minor readability improvement.

♻️ Suggested refactor for clarity

+    def validate_input(text: str):
+        return validator.validate(text, metadata=None)
+
     with Profiler() as p:
-        results = normalized_df["input"].apply(
-            lambda x: p.record(lambda t: validator.validate(t, metadata=None), x)
-        )
+        results = normalized_df["input"].apply(
+            lambda x: p.record(validate_input, x)
+        )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/topic_relevance/run.py` around lines 68 - 71, The
nested lambda used in normalized_df["input"].apply is hard to read: currently it
calls p.record with another lambda wrapping validator.validate. Replace the
nested lambda with a small local helper function (e.g., validate_input or
record_validation) that takes the input, calls p.record with validator.validate
(passing metadata=None), and returns the result; then pass that helper to
normalized_df["input"].apply so Profiler(), p.record, and validator.validate are
used but the call site is clearer.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/evaluation/ban_list/run.py`:
- Around line 40-65: After loading dataset (dataset = pd.read_csv(DATASET_PATH))
validate required columns up front: ensure either "label" exists or both
"source_text" and "target_text" exist; if neither condition holds raise a clear
ValueError describing the missing columns. Place this check before any metric
derivation (before computing results/redacted_text/y_pred or run_ban_list
profiling) so the script fails fast; reference dataset, DATASET_PATH, and the
run_ban_list/Profiler block when adding the validation.
- Around line 42-48: The BanList is being constructed without an explicit
on_fail mode so it defaults to noop and FailResult.fix_value remains unset;
update the BanList initialization in run_ban_list to pass the intended on_fail
behavior (e.g., the same mode used by BanListSafetyValidatorConfig such as
"redact" or the appropriate enum/value) so failures produce a populated
FailResult.fix_value, and ensure the downstream variable (redacted_text) will
contain the redacted value from result.fix_value when result is a FailResult in
the run_ban_list function.

In `@backend/app/tests/test_validate_with_guard.py`:
- Around line 319-339: The test name and assertions are misleading because
_resolve_topic_relevance_scope currently only checks topic_relevance_config_id,
not whether a validator has an inline configuration; update the test to reflect
actual behavior by renaming it and asserting that when topic_relevance_config_id
is absent on the GuardrailRequest the CRUD get is not called (and conversely add
a test that sets topic_relevance_config_id to ensure mock_get is called),
referencing _resolve_topic_relevance_scope, GuardrailRequest,
validator.configuration, and topic_relevance_crud.get; alternatively, if you
intended runtime inline configuration to suppress lookups, modify
_resolve_topic_relevance_scope to check validator.get("configuration") and
short-circuit when present, then update or remove the duplicate tests
accordingly.

---

Outside diff comments:
In `@backend/app/tests/test_validate_with_guard.py`:
- Around line 6-9: The test failure is caused by a missing import for
_resolve_topic_relevance_scope used in tests; update the import list from
app.api.routes.guardrails to include _resolve_topic_relevance_scope alongside
_resolve_validator_configs and _validate_with_guard so the symbol is available
to the tests, then run the test suite to confirm the NameError is resolved.

---

Nitpick comments:
In `@backend/app/evaluation/topic_relevance/run.py`:
- Around line 68-71: The nested lambda used in normalized_df["input"].apply is
hard to read: currently it calls p.record with another lambda wrapping
validator.validate. Replace the nested lambda with a small local helper function
(e.g., validate_input or record_validation) that takes the input, calls p.record
with validator.validate (passing metadata=None), and returns the result; then
pass that helper to normalized_df["input"].apply so Profiler(), p.record, and
validator.validate are used but the call site is clearer.

In `@backend/scripts/run_all_evaluations.sh`:
- Around line 19-26: Add a preflight existence/readability check for each runner
inside the for loop: before running uv run python on "$runner" (the RUNNERS
array and the loop using variable runner and name), verify the path exists and
is a regular file/readable and, if not, print a clear error referencing $runner
and exit non‑zero (or skip based on desired behavior) so the script fails fast
with a helpful message instead of relying on the Python invocation to report a
missing file.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7450187-141d-47cd-9aa6-ac2c7f81a2a2

📥 Commits

Reviewing files that changed from the base of the PR and between a4c9b0d and bfa818b.

📒 Files selected for processing (11)

.gitignore
backend/app/api/docs/topic_relevance_configs/create_topic_relevance_config.md
backend/app/api/docs/topic_relevance_configs/delete_topic_relevance_config.md
backend/app/api/docs/topic_relevance_configs/get_topic_relevance_config.md
backend/app/api/docs/topic_relevance_configs/list_topic_relevance_configs.md
backend/app/api/docs/topic_relevance_configs/update_topic_relevance_config.md
backend/app/api/routes/guardrails.py
backend/app/evaluation/ban_list/run.py
backend/app/evaluation/topic_relevance/run.py
backend/app/tests/test_validate_with_guard.py
backend/scripts/run_all_evaluations.sh

backend/app/evaluation/ban_list/run.py

backend/app/tests/test_validate_with_guard.py

rkritika1508 · 2026-03-19T06:03:05Z

@coderabbitai full review

coderabbitai · 2026-03-19T06:03:10Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

backend/app/evaluation/ban_list/run.py (2)

40-65: ⚠️ Potential issue | 🟡 Minor

Add fail-fast dataset schema validation before deriving metrics.

If required columns are absent, the script fails late (e.g., at Line 52 or Line 64) with less actionable errors. Validate required columns immediately after load.

Proposed fix

     dataset = pd.read_csv(DATASET_PATH)
+    required = {"source_text"}
+    missing_required = required - set(dataset.columns)
+    if missing_required:
+        raise ValueError(f"Dataset missing required columns: {sorted(missing_required)}")
+    if "label" not in dataset.columns and "target_text" not in dataset.columns:
+        raise ValueError("Dataset must include either 'label' or 'target_text'.")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/ban_list/run.py` around lines 40 - 65, After loading
the CSV into dataset (DATASET_PATH), perform a fail-fast schema validation that
checks for required columns before running run_ban_list/Profiler: ensure
"source_text" always exists, and either "label" exists or both "target_text"
exists (so the y_true fallback can run); if any required column is missing,
raise a clear ValueError or log an error listing the missing columns and exit
early so downstream calls to dataset["source_text"], dataset["target_text"], or
dataset["label"] do not fail with obscure errors.

42-48: ⚠️ Potential issue | 🟠 Major

Make BanList on_fail explicit to match redaction intent.

At Line 42, the default on_fail mode may leave FailResult.fix_value unset, so redacted_text at Line 57 can end up being original text. Set an explicit mode aligned with intended behavior.

Proposed fix

-    validator = BanList(banned_words=banned_words)
+    validator = BanList(banned_words=banned_words, on_fail="fix")

For the current guardrails.hub BanList validator, what are the default `on_fail` behavior and the valid `on_fail` modes that populate `FailResult.fix_value`?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/ban_list/run.py` around lines 42 - 48, The BanList
validator is created without an explicit on_fail mode so FailResult.fix_value
may be unset; change the BanList instantiation in run_ban_list to include an
explicit redaction mode (e.g., set on_fail="redact") so that
validator.validate(...) returns a FailResult with fix_value populated and
redacted_text uses that value; update the BanList(...) call (and any related
tests) to explicitly pass on_fail="redact" to ensure predictable redaction
behavior.

🧹 Nitpick comments (1)

.gitignore (1)

12-13: Narrow dataset ignore patterns to avoid hiding source-of-truth assets.

Ignoring backend/app/evaluation/datasets/**/*.csv and **/*.txt can silently block new/updated evaluation datasets and topic configs from being committed. That can break reproducibility for other contributors.

Proposed change

-backend/app/evaluation/datasets/**/*.csv
-backend/app/evaluation/datasets/**/*.txt
+# Keep canonical evaluation datasets/configs tracked.
+# Ignore only generated artifacts under outputs.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 12 - 13, The .gitignore currently hides all CSV and
TXT files under backend/app/evaluation/datasets/** which can prevent
source-of-truth evaluation datasets and topic configs from being committed;
update the ignore rules to only exclude truly generated or transient files
(e.g., files under a generated/ or outputs/ subdirectory) and remove or scope
the broad patterns backend/app/evaluation/datasets/**/*.csv and
backend/app/evaluation/datasets/**/*.txt so that canonical dataset files are
tracked; specifically, replace the global patterns with more targeted patterns
(or add explicit negative exceptions) so functions/processes that rely on files
in backend/app/evaluation/datasets (the dataset/topic config source) aren’t
accidentally hidden.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/evaluation/topic_relevance/run.py`:
- Around line 50-63: Before creating normalized_df, validate that df contains
required columns ("input", "category", "scope") and that every value in
df["scope"] (after .astype(str).str.strip().upper()) is one of the allowed
labels (e.g., "IN_SCOPE" or "OUT_OF_SCOPE"); on any missing column or invalid
scope value, raise a clear Exception referencing dataset_path so the run fails
fast. In the TopicRelevance flow, normalize scope strings via
.str.strip().str.upper() and map "IN_SCOPE" -> 1 and "OUT_OF_SCOPE" -> 0 only
after validation (do not default other values to 0), and update the creation of
normalized_df accordingly so y_true cannot be silently corrupted.

---

Duplicate comments:
In `@backend/app/evaluation/ban_list/run.py`:
- Around line 40-65: After loading the CSV into dataset (DATASET_PATH), perform
a fail-fast schema validation that checks for required columns before running
run_ban_list/Profiler: ensure "source_text" always exists, and either "label"
exists or both "target_text" exists (so the y_true fallback can run); if any
required column is missing, raise a clear ValueError or log an error listing the
missing columns and exit early so downstream calls to dataset["source_text"],
dataset["target_text"], or dataset["label"] do not fail with obscure errors.
- Around line 42-48: The BanList validator is created without an explicit
on_fail mode so FailResult.fix_value may be unset; change the BanList
instantiation in run_ban_list to include an explicit redaction mode (e.g., set
on_fail="redact") so that validator.validate(...) returns a FailResult with
fix_value populated and redacted_text uses that value; update the BanList(...)
call (and any related tests) to explicitly pass on_fail="redact" to ensure
predictable redaction behavior.

---

Nitpick comments:
In @.gitignore:
- Around line 12-13: The .gitignore currently hides all CSV and TXT files under
backend/app/evaluation/datasets/** which can prevent source-of-truth evaluation
datasets and topic configs from being committed; update the ignore rules to only
exclude truly generated or transient files (e.g., files under a generated/ or
outputs/ subdirectory) and remove or scope the broad patterns
backend/app/evaluation/datasets/**/*.csv and
backend/app/evaluation/datasets/**/*.txt so that canonical dataset files are
tracked; specifically, replace the global patterns with more targeted patterns
(or add explicit negative exceptions) so functions/processes that rely on files
in backend/app/evaluation/datasets (the dataset/topic config source) aren’t
accidentally hidden.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 47a4c401-b8a4-4a49-9903-ebf13940fa63

📥 Commits

Reviewing files that changed from the base of the PR and between a4c9b0d and 5f23269.

📒 Files selected for processing (4)

.gitignore
backend/app/evaluation/ban_list/run.py
backend/app/evaluation/topic_relevance/run.py
backend/scripts/run_all_evaluations.sh

backend/app/evaluation/topic_relevance/run.py

nishika26 · 2026-03-20T09:12:12Z

I feel like we should make a seperate readme for evaluation scripts because since making a dataset for each evaluation script depending on the validator is turning out to be differet, there should be a seperate md in evaluation folder that contains everything about evaluation

backend/app/evaluation/topic_relevance/run.py

backend/app/evaluation/ban_list/run.py

backend/app/evaluation/topic_relevance/run.py

coderabbitai

🧹 Nitpick comments (2)

backend/app/evaluation/topic_relevance/run.py (2)

102-113: Minor inconsistency: use dataset_path.name for consistency with ban_list.

Line 107 uses str(dataset_path) (full path), while ban_list/run.py uses str(DATASET_PATH.name) (filename only). Consider aligning for consistent report format.

Proposed fix

-            dataset=str(dataset_path),
+            dataset=str(dataset_path.name),

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/topic_relevance/run.py` around lines 102 - 113, The
report currently passes the full dataset path via str(dataset_path) into
build_evaluation_report; change this to use dataset_path.name so the report uses
the filename (consistent with ban_list's use of DATASET_PATH.name). Update the
call site in the block that calls write_json(build_evaluation_report(...,
dataset=str(dataset_path), ...), OUT_DIR / f"{domain}-metrics.json") to pass
dataset=dataset_path.name instead, leaving other parameters (guardrail,
num_samples, profiler p, llm_callable, prompt_schema_version, metrics) and the
write_json/OUT_DIR usage unchanged.

73-76: Consider simplifying the nested lambda.

The nested lambda lambda x: p.record(lambda t: validator.validate(t, metadata=None), x) creates a new inner function object per row. This can be simplified by defining the validation function once.

Proposed refactor

+    def validate_input(text: str):
+        return validator.validate(text, metadata=None)
+
     with Profiler() as p:
         results = normalized_df["input"].apply(
-            lambda x: p.record(lambda t: validator.validate(t, metadata=None), x)
+            lambda x: p.record(validate_input, x)
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/topic_relevance/run.py` around lines 73 - 76, The
apply creates a new inner lambda per row; instead define the validation callable
once and pass it into p.record to avoid recreating functions repeatedly. In
run.py, create a single validation function (e.g., with functools.partial or a
named lambda) that calls validator.validate(..., metadata=None) and then call
normalized_df["input"].apply(lambda x: p.record(validation_fn, x)) (or
equivalent) so Profiler.p.record receives the pre-defined callable instead of
constructing a new inner lambda for every row.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/app/evaluation/topic_relevance/run.py`:
- Around line 102-113: The report currently passes the full dataset path via
str(dataset_path) into build_evaluation_report; change this to use
dataset_path.name so the report uses the filename (consistent with ban_list's
use of DATASET_PATH.name). Update the call site in the block that calls
write_json(build_evaluation_report(..., dataset=str(dataset_path), ...), OUT_DIR
/ f"{domain}-metrics.json") to pass dataset=dataset_path.name instead, leaving
other parameters (guardrail, num_samples, profiler p, llm_callable,
prompt_schema_version, metrics) and the write_json/OUT_DIR usage unchanged.
- Around line 73-76: The apply creates a new inner lambda per row; instead
define the validation callable once and pass it into p.record to avoid
recreating functions repeatedly. In run.py, create a single validation function
(e.g., with functools.partial or a named lambda) that calls
validator.validate(..., metadata=None) and then call
normalized_df["input"].apply(lambda x: p.record(validation_fn, x)) (or
equivalent) so Profiler.p.record receives the pre-defined callable instead of
constructing a new inner lambda for every row.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ae72bfa2-b27a-40f9-a585-45bdcd7fba66

📥 Commits

Reviewing files that changed from the base of the PR and between 5f23269 and 8a7bdf9.

📒 Files selected for processing (3)

backend/app/evaluation/ban_list/run.py
backend/app/evaluation/common/helper.py
backend/app/evaluation/topic_relevance/run.py

coderabbitai

🧹 Nitpick comments (2)

backend/app/evaluation/topic_relevance/run.py (2)

51-57: Add file existence checks before reading.

If the dataset or topic config files are missing, the current code will raise less informative errors. Consider validating file existence upfront.

Proposed improvement

     dataset_path = DATASETS_DIR / config["dataset"]
     topic_config_path = DATASETS_DIR / config["topic_config"]
+
+    if not dataset_path.exists():
+        raise FileNotFoundError(f"Dataset not found: {dataset_path}")
+    if not topic_config_path.exists():
+        raise FileNotFoundError(f"Topic config not found: {topic_config_path}")
+
     topic_config = topic_config_path.read_text()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/topic_relevance/run.py` around lines 51 - 57, Before
reading files, validate that dataset_path and topic_config_path exist
(constructed from DATASETS_DIR and config["dataset"]/config["topic_config"]) and
raise or log a clear error if missing; replace direct calls to
topic_config_path.read_text() and pd.read_csv(dataset_path) with existence
checks (e.g., if not dataset_path.exists(): raise FileNotFoundError(f"Dataset
not found: {dataset_path}")) so the topic_config and dataframe loading in run.py
produce informative messages including the domain and the missing path.

55-55: Consider using logging instead of print for progress output.

Static analysis flagged the print calls. For evaluation scripts, progress output is reasonable, but using logging would provide better control over verbosity and align with production code standards.

Also applies to: 117-117

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/topic_relevance/run.py` at line 55, Replace the
module-level print calls (e.g., the print(f"\nRunning topic relevance
evaluation: {domain}") and the other print at the later location) with standard
logging calls: obtain a logger via logging.getLogger(__name__) at the top of
backend/app/evaluation/topic_relevance/run.py, then use logger.info or
logger.debug for progress messages and logger.error for errors; ensure the
logger is configured (or rely on application logging configuration) instead of
using print so verbosity can be controlled consistently.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/app/evaluation/topic_relevance/run.py`:
- Around line 51-57: Before reading files, validate that dataset_path and
topic_config_path exist (constructed from DATASETS_DIR and
config["dataset"]/config["topic_config"]) and raise or log a clear error if
missing; replace direct calls to topic_config_path.read_text() and
pd.read_csv(dataset_path) with existence checks (e.g., if not
dataset_path.exists(): raise FileNotFoundError(f"Dataset not found:
{dataset_path}")) so the topic_config and dataframe loading in run.py produce
informative messages including the domain and the missing path.
- Line 55: Replace the module-level print calls (e.g., the print(f"\nRunning
topic relevance evaluation: {domain}") and the other print at the later
location) with standard logging calls: obtain a logger via
logging.getLogger(__name__) at the top of
backend/app/evaluation/topic_relevance/run.py, then use logger.info or
logger.debug for progress messages and logger.error for errors; ensure the
logger is configured (or rely on application logging configuration) instead of
using print so verbosity can be controlled consistently.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a8be8840-34f7-4b83-b854-ad1fe97f02a8

📥 Commits

Reviewing files that changed from the base of the PR and between 8a7bdf9 and ea61cff.

📒 Files selected for processing (1)

backend/app/evaluation/topic_relevance/run.py

rkritika1508 added 15 commits March 3, 2026 14:17

added llm critic validator

9ae1fb7

Added topic relevance validator

76f98f7

updated topic relevance validator code

d717740

added alembic for topic relevance table

dfb3c03

added tests

9d7d6dc

changed configuration type from json to string

aaa2e7d

fixed ci

dfeff2a

fixed ci

4cf6b86

updated ci

d8fe009

removed ci file

1090a96

resolved comments

efa3641

resolved comment

2813efb

fixed tests

6916fb4

fixed bug

7a770e8

Added evaluation script for topic relevance

cf834ce

rkritika1508 self-assigned this Mar 6, 2026

rkritika1508 added 4 commits March 6, 2026 13:28

Updated code and script

7f80ed8

updated prompt

ca44f62

Merge branch 'feat/llm-critic' into feat/topic-relevance-evaluation

d1a41c8

simplified scripts

3e48942

rkritika1508 linked an issue Mar 6, 2026 that may be closed by this pull request

Topic relevance validator evaluation #75

Closed

Base automatically changed from feat/llm-critic to main March 19, 2026 05:32

rkritika1508 added 2 commits March 19, 2026 11:13

Merge branch 'main' into feat/topic-relevance-evaluation

bfa818b

removed redundant files

5f23269

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

backend/app/evaluation/ban_list/run.py Show resolved Hide resolved

backend/app/evaluation/ban_list/run.py Show resolved Hide resolved

backend/app/tests/test_validate_with_guard.py Outdated Show resolved Hide resolved

rkritika1508 added enhancement New feature or request ready-for-review labels Mar 19, 2026

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

backend/app/evaluation/topic_relevance/run.py Show resolved Hide resolved

nishika26 requested changes Mar 20, 2026

View reviewed changes

backend/app/evaluation/topic_relevance/run.py Outdated Show resolved Hide resolved

backend/app/evaluation/topic_relevance/run.py Show resolved Hide resolved

backend/app/evaluation/ban_list/run.py Show resolved Hide resolved

resolved comments

8a7bdf9

nishika26 reviewed Mar 20, 2026

View reviewed changes

backend/app/evaluation/topic_relevance/run.py Outdated Show resolved Hide resolved

nishika26 approved these changes Mar 20, 2026

View reviewed changes

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

resolved comment

ea61cff

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

rkritika1508 merged commit 95a1935 into main Mar 20, 2026
2 checks passed

rkritika1508 deleted the feat/topic-relevance-evaluation branch March 20, 2026 10:38

nishika26 mentioned this pull request Mar 20, 2026

Add a separate markdown file for evaluation #77

Open

Conversation

rkritika1508 commented Mar 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rkritika1508 commented Mar 19, 2026

Uh oh!

coderabbitai bot commented Mar 19, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nishika26 commented Mar 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rkritika1508 commented Mar 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 6, 2026 •

edited

Loading