Skip to content

Added evaluation script for topic relevance#74

Merged
rkritika1508 merged 23 commits intomainfrom
feat/topic-relevance-evaluation
Mar 20, 2026
Merged

Added evaluation script for topic relevance#74
rkritika1508 merged 23 commits intomainfrom
feat/topic-relevance-evaluation

Conversation

@rkritika1508
Copy link
Collaborator

@rkritika1508 rkritika1508 commented Mar 6, 2026

Summary

Target issue is #75.
Explain the motivation for making this change. What existing problem does the pull request solve?
This PR contains the following changes -

  1. Added code to evaluate topic relevance validator. backend/app/evaluation/topic_relevance/run.py has some pre-configured topic configurations and datasets that we will be running to evaluate the validator performance.
  2. Made the ban-list evaluation consistent with topic relevance validator as well. Now, we have a basic framework while writing all evaluation scripts for validators.
  3. Updated the shell script to add topic relevance evaluator.

All testing datasets are present here - https://drive.google.com/drive/u/0/folders/1Rd1LH-oEwCkU0pBDRrYYedExorwmXA89

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

  • New Features

    • Added topic-relevance evaluations for education and healthcare with per-category metrics and profiling.
    • Ban-list evaluations now run per configuration and produce separate prediction and metrics outputs.
  • Improvements

    • Metrics now include overall accuracy alongside existing binary scores.
    • Evaluation runner output formatting improved for clarity.
  • Chores

    • Broadened ignore patterns for evaluation datasets and outputs in .gitignore.
    • Runner no longer accepts inline KEY=VALUE environment assignments.

@coderabbitai
Copy link

coderabbitai bot commented Mar 6, 2026

📝 Walkthrough

Walkthrough

Added a new topic-relevance evaluator, refactored the ban-list evaluator to use in-code evaluation configs and reusable runner functions, expanded .gitignore to broader evaluation output patterns, updated the top-level evaluation runner script, and added accuracy to binary metrics.

Changes

Cohort / File(s) Summary
Configuration and Ignore Patterns
\.gitignore
Replaced specific ignore entries (metrics.json, predictions.csv, dataset CSVs) with recursive patterns to ignore CSV/TXT under backend/app/evaluation/datasets/**/ and CSV/JSON under backend/app/evaluation/outputs/**/.
Ban List Evaluator Refactoring
backend/app/evaluation/ban_list/run.py
Removed env-var-driven flow; added BAN_LIST_EVALUATIONS list, run_evaluation(config) and main(); changed outputs to per-config "{name}-predictions.csv" / "{name}-metrics.json; added start/completion logs.
Topic Relevance Evaluator
backend/app/evaluation/topic_relevance/run.py
New module implementing domain evaluations (education, healthcare): loads datasets/topic configs, normalizes inputs, runs TopicRelevance with Profiler, computes overall and per-category binary metrics, writes per-domain predictions and metrics (with profiler).
Evaluation Runner Script
backend/scripts/run_all_evaluations.sh
Removed KEY=VALUE argument export handling; added topic_relevance/run.py to runner list; adjusted console banner spacing.
Common Metrics Helper
backend/app/evaluation/common/helper.py
compute_binary_metrics(y_true, y_pred) now returns accuracy (rounded to 2 decimals) in addition to existing confusion/score metrics.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Runner as Evaluation Runner (sh)
participant Script as Eval Script (Python)
participant Validator as Validator (BanList / TopicRelevance)
participant Profiler as Profiler
participant FS as Filesystem (datasets/outputs)
Runner->>Script: invoke runner script -> python eval script
Script->>FS: read dataset CSV & (optional) topic config
Script->>Validator: construct validator, call validate(input)
Validator-->>Script: validation result (pass/fail + meta)
Script->>Profiler: record timing/metadata
Script->>FS: write predictions CSV ({name}-predictions.csv)
Script->>FS: write metrics JSON ({name}-metrics.json with profiler & per-category metrics)
Script-->>Runner: completion log

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • nishika26
  • AkhileshNegi

Poem

🐰 I hopped through datasets, files in tow,
Banned words tumbled, topics in a row,
Scripts now run with tidy names,
Metrics waltz and profiling frames,
Rabbity cheers for outputs that grow!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.85% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary change: adding a new evaluation script for topic relevance.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/topic-relevance-evaluation
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

Migrating from UI to YAML configuration.

Use the @coderabbitai configuration command in a PR comment to get a dump of all your UI settings in YAML format. You can then edit this YAML file and upload it to the root of your repository to configure CodeRabbit programmatically.

@rkritika1508 rkritika1508 self-assigned this Mar 6, 2026
@rkritika1508 rkritika1508 linked an issue Mar 6, 2026 that may be closed by this pull request
Base automatically changed from feat/llm-critic to main March 19, 2026 05:32
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/tests/test_validate_with_guard.py (1)

6-9: ⚠️ Potential issue | 🔴 Critical

Missing import causes CI failure: _resolve_topic_relevance_scope is not imported.

The tests at lines 272-339 call _resolve_topic_relevance_scope, but this function is not included in the import statement. This causes the NameError failures in CI.

🐛 Proposed fix to add the missing import
 from app.api.routes.guardrails import (
     _resolve_validator_configs,
     _validate_with_guard,
+    _resolve_topic_relevance_scope,
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/test_validate_with_guard.py` around lines 6 - 9, The test
failure is caused by a missing import for _resolve_topic_relevance_scope used in
tests; update the import list from app.api.routes.guardrails to include
_resolve_topic_relevance_scope alongside _resolve_validator_configs and
_validate_with_guard so the symbol is available to the tests, then run the test
suite to confirm the NameError is resolved.
🧹 Nitpick comments (2)
backend/scripts/run_all_evaluations.sh (1)

19-26: Consider a preflight existence check for each runner path.

This gives a clearer error than a Python invocation failure when a path is wrong/moved.

Proposed refactor
 for runner in "${RUNNERS[@]}"; do
   name="$(basename "$(dirname "$runner")")"
 
+  if [[ ! -f "$runner" ]]; then
+    echo "Missing runner: $runner" >&2
+    exit 1
+  fi
+
   echo ""
   echo "==> [$name] $runner"
 
   uv run python "$runner"
 done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/scripts/run_all_evaluations.sh` around lines 19 - 26, Add a preflight
existence/readability check for each runner inside the for loop: before running
uv run python on "$runner" (the RUNNERS array and the loop using variable runner
and name), verify the path exists and is a regular file/readable and, if not,
print a clear error referencing $runner and exit non‑zero (or skip based on
desired behavior) so the script fails fast with a helpful message instead of
relying on the Python invocation to report a missing file.
backend/app/evaluation/topic_relevance/run.py (1)

68-71: Consider simplifying the nested lambda for readability.

The nested lambda structure is slightly convoluted:

lambda x: p.record(lambda t: validator.validate(t, metadata=None), x)

A local function would be clearer, though this is a minor readability improvement.

♻️ Suggested refactor for clarity
+    def validate_input(text: str):
+        return validator.validate(text, metadata=None)
+
     with Profiler() as p:
-        results = normalized_df["input"].apply(
-            lambda x: p.record(lambda t: validator.validate(t, metadata=None), x)
-        )
+        results = normalized_df["input"].apply(
+            lambda x: p.record(validate_input, x)
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/topic_relevance/run.py` around lines 68 - 71, The
nested lambda used in normalized_df["input"].apply is hard to read: currently it
calls p.record with another lambda wrapping validator.validate. Replace the
nested lambda with a small local helper function (e.g., validate_input or
record_validation) that takes the input, calls p.record with validator.validate
(passing metadata=None), and returns the result; then pass that helper to
normalized_df["input"].apply so Profiler(), p.record, and validator.validate are
used but the call site is clearer.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/evaluation/ban_list/run.py`:
- Around line 40-65: After loading dataset (dataset = pd.read_csv(DATASET_PATH))
validate required columns up front: ensure either "label" exists or both
"source_text" and "target_text" exist; if neither condition holds raise a clear
ValueError describing the missing columns. Place this check before any metric
derivation (before computing results/redacted_text/y_pred or run_ban_list
profiling) so the script fails fast; reference dataset, DATASET_PATH, and the
run_ban_list/Profiler block when adding the validation.
- Around line 42-48: The BanList is being constructed without an explicit
on_fail mode so it defaults to noop and FailResult.fix_value remains unset;
update the BanList initialization in run_ban_list to pass the intended on_fail
behavior (e.g., the same mode used by BanListSafetyValidatorConfig such as
"redact" or the appropriate enum/value) so failures produce a populated
FailResult.fix_value, and ensure the downstream variable (redacted_text) will
contain the redacted value from result.fix_value when result is a FailResult in
the run_ban_list function.

In `@backend/app/tests/test_validate_with_guard.py`:
- Around line 319-339: The test name and assertions are misleading because
_resolve_topic_relevance_scope currently only checks topic_relevance_config_id,
not whether a validator has an inline configuration; update the test to reflect
actual behavior by renaming it and asserting that when topic_relevance_config_id
is absent on the GuardrailRequest the CRUD get is not called (and conversely add
a test that sets topic_relevance_config_id to ensure mock_get is called),
referencing _resolve_topic_relevance_scope, GuardrailRequest,
validator.configuration, and topic_relevance_crud.get; alternatively, if you
intended runtime inline configuration to suppress lookups, modify
_resolve_topic_relevance_scope to check validator.get("configuration") and
short-circuit when present, then update or remove the duplicate tests
accordingly.

---

Outside diff comments:
In `@backend/app/tests/test_validate_with_guard.py`:
- Around line 6-9: The test failure is caused by a missing import for
_resolve_topic_relevance_scope used in tests; update the import list from
app.api.routes.guardrails to include _resolve_topic_relevance_scope alongside
_resolve_validator_configs and _validate_with_guard so the symbol is available
to the tests, then run the test suite to confirm the NameError is resolved.

---

Nitpick comments:
In `@backend/app/evaluation/topic_relevance/run.py`:
- Around line 68-71: The nested lambda used in normalized_df["input"].apply is
hard to read: currently it calls p.record with another lambda wrapping
validator.validate. Replace the nested lambda with a small local helper function
(e.g., validate_input or record_validation) that takes the input, calls p.record
with validator.validate (passing metadata=None), and returns the result; then
pass that helper to normalized_df["input"].apply so Profiler(), p.record, and
validator.validate are used but the call site is clearer.

In `@backend/scripts/run_all_evaluations.sh`:
- Around line 19-26: Add a preflight existence/readability check for each runner
inside the for loop: before running uv run python on "$runner" (the RUNNERS
array and the loop using variable runner and name), verify the path exists and
is a regular file/readable and, if not, print a clear error referencing $runner
and exit non‑zero (or skip based on desired behavior) so the script fails fast
with a helpful message instead of relying on the Python invocation to report a
missing file.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7450187-141d-47cd-9aa6-ac2c7f81a2a2

📥 Commits

Reviewing files that changed from the base of the PR and between a4c9b0d and bfa818b.

📒 Files selected for processing (11)
  • .gitignore
  • backend/app/api/docs/topic_relevance_configs/create_topic_relevance_config.md
  • backend/app/api/docs/topic_relevance_configs/delete_topic_relevance_config.md
  • backend/app/api/docs/topic_relevance_configs/get_topic_relevance_config.md
  • backend/app/api/docs/topic_relevance_configs/list_topic_relevance_configs.md
  • backend/app/api/docs/topic_relevance_configs/update_topic_relevance_config.md
  • backend/app/api/routes/guardrails.py
  • backend/app/evaluation/ban_list/run.py
  • backend/app/evaluation/topic_relevance/run.py
  • backend/app/tests/test_validate_with_guard.py
  • backend/scripts/run_all_evaluations.sh

@rkritika1508 rkritika1508 added enhancement New feature or request ready-for-review labels Mar 19, 2026
@rkritika1508
Copy link
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link

coderabbitai bot commented Mar 19, 2026

✅ Actions performed

Full review triggered.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
backend/app/evaluation/ban_list/run.py (2)

40-65: ⚠️ Potential issue | 🟡 Minor

Add fail-fast dataset schema validation before deriving metrics.

If required columns are absent, the script fails late (e.g., at Line 52 or Line 64) with less actionable errors. Validate required columns immediately after load.

Proposed fix
     dataset = pd.read_csv(DATASET_PATH)
+    required = {"source_text"}
+    missing_required = required - set(dataset.columns)
+    if missing_required:
+        raise ValueError(f"Dataset missing required columns: {sorted(missing_required)}")
+    if "label" not in dataset.columns and "target_text" not in dataset.columns:
+        raise ValueError("Dataset must include either 'label' or 'target_text'.")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/ban_list/run.py` around lines 40 - 65, After loading
the CSV into dataset (DATASET_PATH), perform a fail-fast schema validation that
checks for required columns before running run_ban_list/Profiler: ensure
"source_text" always exists, and either "label" exists or both "target_text"
exists (so the y_true fallback can run); if any required column is missing,
raise a clear ValueError or log an error listing the missing columns and exit
early so downstream calls to dataset["source_text"], dataset["target_text"], or
dataset["label"] do not fail with obscure errors.

42-48: ⚠️ Potential issue | 🟠 Major

Make BanList on_fail explicit to match redaction intent.

At Line 42, the default on_fail mode may leave FailResult.fix_value unset, so redacted_text at Line 57 can end up being original text. Set an explicit mode aligned with intended behavior.

Proposed fix
-    validator = BanList(banned_words=banned_words)
+    validator = BanList(banned_words=banned_words, on_fail="fix")
For the current guardrails.hub BanList validator, what are the default `on_fail` behavior and the valid `on_fail` modes that populate `FailResult.fix_value`?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/ban_list/run.py` around lines 42 - 48, The BanList
validator is created without an explicit on_fail mode so FailResult.fix_value
may be unset; change the BanList instantiation in run_ban_list to include an
explicit redaction mode (e.g., set on_fail="redact") so that
validator.validate(...) returns a FailResult with fix_value populated and
redacted_text uses that value; update the BanList(...) call (and any related
tests) to explicitly pass on_fail="redact" to ensure predictable redaction
behavior.
🧹 Nitpick comments (1)
.gitignore (1)

12-13: Narrow dataset ignore patterns to avoid hiding source-of-truth assets.

Ignoring backend/app/evaluation/datasets/**/*.csv and **/*.txt can silently block new/updated evaluation datasets and topic configs from being committed. That can break reproducibility for other contributors.

Proposed change
-backend/app/evaluation/datasets/**/*.csv
-backend/app/evaluation/datasets/**/*.txt
+# Keep canonical evaluation datasets/configs tracked.
+# Ignore only generated artifacts under outputs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 12 - 13, The .gitignore currently hides all CSV and
TXT files under backend/app/evaluation/datasets/** which can prevent
source-of-truth evaluation datasets and topic configs from being committed;
update the ignore rules to only exclude truly generated or transient files
(e.g., files under a generated/ or outputs/ subdirectory) and remove or scope
the broad patterns backend/app/evaluation/datasets/**/*.csv and
backend/app/evaluation/datasets/**/*.txt so that canonical dataset files are
tracked; specifically, replace the global patterns with more targeted patterns
(or add explicit negative exceptions) so functions/processes that rely on files
in backend/app/evaluation/datasets (the dataset/topic config source) aren’t
accidentally hidden.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/evaluation/topic_relevance/run.py`:
- Around line 50-63: Before creating normalized_df, validate that df contains
required columns ("input", "category", "scope") and that every value in
df["scope"] (after .astype(str).str.strip().upper()) is one of the allowed
labels (e.g., "IN_SCOPE" or "OUT_OF_SCOPE"); on any missing column or invalid
scope value, raise a clear Exception referencing dataset_path so the run fails
fast. In the TopicRelevance flow, normalize scope strings via
.str.strip().str.upper() and map "IN_SCOPE" -> 1 and "OUT_OF_SCOPE" -> 0 only
after validation (do not default other values to 0), and update the creation of
normalized_df accordingly so y_true cannot be silently corrupted.

---

Duplicate comments:
In `@backend/app/evaluation/ban_list/run.py`:
- Around line 40-65: After loading the CSV into dataset (DATASET_PATH), perform
a fail-fast schema validation that checks for required columns before running
run_ban_list/Profiler: ensure "source_text" always exists, and either "label"
exists or both "target_text" exists (so the y_true fallback can run); if any
required column is missing, raise a clear ValueError or log an error listing the
missing columns and exit early so downstream calls to dataset["source_text"],
dataset["target_text"], or dataset["label"] do not fail with obscure errors.
- Around line 42-48: The BanList validator is created without an explicit
on_fail mode so FailResult.fix_value may be unset; change the BanList
instantiation in run_ban_list to include an explicit redaction mode (e.g., set
on_fail="redact") so that validator.validate(...) returns a FailResult with
fix_value populated and redacted_text uses that value; update the BanList(...)
call (and any related tests) to explicitly pass on_fail="redact" to ensure
predictable redaction behavior.

---

Nitpick comments:
In @.gitignore:
- Around line 12-13: The .gitignore currently hides all CSV and TXT files under
backend/app/evaluation/datasets/** which can prevent source-of-truth evaluation
datasets and topic configs from being committed; update the ignore rules to only
exclude truly generated or transient files (e.g., files under a generated/ or
outputs/ subdirectory) and remove or scope the broad patterns
backend/app/evaluation/datasets/**/*.csv and
backend/app/evaluation/datasets/**/*.txt so that canonical dataset files are
tracked; specifically, replace the global patterns with more targeted patterns
(or add explicit negative exceptions) so functions/processes that rely on files
in backend/app/evaluation/datasets (the dataset/topic config source) aren’t
accidentally hidden.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 47a4c401-b8a4-4a49-9903-ebf13940fa63

📥 Commits

Reviewing files that changed from the base of the PR and between a4c9b0d and 5f23269.

📒 Files selected for processing (4)
  • .gitignore
  • backend/app/evaluation/ban_list/run.py
  • backend/app/evaluation/topic_relevance/run.py
  • backend/scripts/run_all_evaluations.sh

@nishika26
Copy link
Contributor

I feel like we should make a seperate readme for evaluation scripts because since making a dataset for each evaluation script depending on the validator is turning out to be differet, there should be a seperate md in evaluation folder that contains everything about evaluation

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
backend/app/evaluation/topic_relevance/run.py (2)

102-113: Minor inconsistency: use dataset_path.name for consistency with ban_list.

Line 107 uses str(dataset_path) (full path), while ban_list/run.py uses str(DATASET_PATH.name) (filename only). Consider aligning for consistent report format.

Proposed fix
-            dataset=str(dataset_path),
+            dataset=str(dataset_path.name),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/topic_relevance/run.py` around lines 102 - 113, The
report currently passes the full dataset path via str(dataset_path) into
build_evaluation_report; change this to use dataset_path.name so the report uses
the filename (consistent with ban_list's use of DATASET_PATH.name). Update the
call site in the block that calls write_json(build_evaluation_report(...,
dataset=str(dataset_path), ...), OUT_DIR / f"{domain}-metrics.json") to pass
dataset=dataset_path.name instead, leaving other parameters (guardrail,
num_samples, profiler p, llm_callable, prompt_schema_version, metrics) and the
write_json/OUT_DIR usage unchanged.

73-76: Consider simplifying the nested lambda.

The nested lambda lambda x: p.record(lambda t: validator.validate(t, metadata=None), x) creates a new inner function object per row. This can be simplified by defining the validation function once.

Proposed refactor
+    def validate_input(text: str):
+        return validator.validate(text, metadata=None)
+
     with Profiler() as p:
         results = normalized_df["input"].apply(
-            lambda x: p.record(lambda t: validator.validate(t, metadata=None), x)
+            lambda x: p.record(validate_input, x)
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/topic_relevance/run.py` around lines 73 - 76, The
apply creates a new inner lambda per row; instead define the validation callable
once and pass it into p.record to avoid recreating functions repeatedly. In
run.py, create a single validation function (e.g., with functools.partial or a
named lambda) that calls validator.validate(..., metadata=None) and then call
normalized_df["input"].apply(lambda x: p.record(validation_fn, x)) (or
equivalent) so Profiler.p.record receives the pre-defined callable instead of
constructing a new inner lambda for every row.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/app/evaluation/topic_relevance/run.py`:
- Around line 102-113: The report currently passes the full dataset path via
str(dataset_path) into build_evaluation_report; change this to use
dataset_path.name so the report uses the filename (consistent with ban_list's
use of DATASET_PATH.name). Update the call site in the block that calls
write_json(build_evaluation_report(..., dataset=str(dataset_path), ...), OUT_DIR
/ f"{domain}-metrics.json") to pass dataset=dataset_path.name instead, leaving
other parameters (guardrail, num_samples, profiler p, llm_callable,
prompt_schema_version, metrics) and the write_json/OUT_DIR usage unchanged.
- Around line 73-76: The apply creates a new inner lambda per row; instead
define the validation callable once and pass it into p.record to avoid
recreating functions repeatedly. In run.py, create a single validation function
(e.g., with functools.partial or a named lambda) that calls
validator.validate(..., metadata=None) and then call
normalized_df["input"].apply(lambda x: p.record(validation_fn, x)) (or
equivalent) so Profiler.p.record receives the pre-defined callable instead of
constructing a new inner lambda for every row.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ae72bfa2-b27a-40f9-a585-45bdcd7fba66

📥 Commits

Reviewing files that changed from the base of the PR and between 5f23269 and 8a7bdf9.

📒 Files selected for processing (3)
  • backend/app/evaluation/ban_list/run.py
  • backend/app/evaluation/common/helper.py
  • backend/app/evaluation/topic_relevance/run.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
backend/app/evaluation/topic_relevance/run.py (2)

51-57: Add file existence checks before reading.

If the dataset or topic config files are missing, the current code will raise less informative errors. Consider validating file existence upfront.

Proposed improvement
     dataset_path = DATASETS_DIR / config["dataset"]
     topic_config_path = DATASETS_DIR / config["topic_config"]
+
+    if not dataset_path.exists():
+        raise FileNotFoundError(f"Dataset not found: {dataset_path}")
+    if not topic_config_path.exists():
+        raise FileNotFoundError(f"Topic config not found: {topic_config_path}")
+
     topic_config = topic_config_path.read_text()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/topic_relevance/run.py` around lines 51 - 57, Before
reading files, validate that dataset_path and topic_config_path exist
(constructed from DATASETS_DIR and config["dataset"]/config["topic_config"]) and
raise or log a clear error if missing; replace direct calls to
topic_config_path.read_text() and pd.read_csv(dataset_path) with existence
checks (e.g., if not dataset_path.exists(): raise FileNotFoundError(f"Dataset
not found: {dataset_path}")) so the topic_config and dataframe loading in run.py
produce informative messages including the domain and the missing path.

55-55: Consider using logging instead of print for progress output.

Static analysis flagged the print calls. For evaluation scripts, progress output is reasonable, but using logging would provide better control over verbosity and align with production code standards.

Also applies to: 117-117

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/evaluation/topic_relevance/run.py` at line 55, Replace the
module-level print calls (e.g., the print(f"\nRunning topic relevance
evaluation: {domain}") and the other print at the later location) with standard
logging calls: obtain a logger via logging.getLogger(__name__) at the top of
backend/app/evaluation/topic_relevance/run.py, then use logger.info or
logger.debug for progress messages and logger.error for errors; ensure the
logger is configured (or rely on application logging configuration) instead of
using print so verbosity can be controlled consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/app/evaluation/topic_relevance/run.py`:
- Around line 51-57: Before reading files, validate that dataset_path and
topic_config_path exist (constructed from DATASETS_DIR and
config["dataset"]/config["topic_config"]) and raise or log a clear error if
missing; replace direct calls to topic_config_path.read_text() and
pd.read_csv(dataset_path) with existence checks (e.g., if not
dataset_path.exists(): raise FileNotFoundError(f"Dataset not found:
{dataset_path}")) so the topic_config and dataframe loading in run.py produce
informative messages including the domain and the missing path.
- Line 55: Replace the module-level print calls (e.g., the print(f"\nRunning
topic relevance evaluation: {domain}") and the other print at the later
location) with standard logging calls: obtain a logger via
logging.getLogger(__name__) at the top of
backend/app/evaluation/topic_relevance/run.py, then use logger.info or
logger.debug for progress messages and logger.error for errors; ensure the
logger is configured (or rely on application logging configuration) instead of
using print so verbosity can be controlled consistently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a8be8840-34f7-4b83-b854-ad1fe97f02a8

📥 Commits

Reviewing files that changed from the base of the PR and between 8a7bdf9 and ea61cff.

📒 Files selected for processing (1)
  • backend/app/evaluation/topic_relevance/run.py

@rkritika1508 rkritika1508 merged commit 95a1935 into main Mar 20, 2026
2 checks passed
@rkritika1508 rkritika1508 deleted the feat/topic-relevance-evaluation branch March 20, 2026 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ready-for-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Topic relevance validator evaluation

2 participants