Skip to content

Build review, evaluation, and report workflows#10

Open
Spbd1 wants to merge 1 commit into
codex/build-react-dashboard-with-specified-featuresfrom
codex/build-review,-evaluation,-and-report-workflows-wm0oc0
Open

Build review, evaluation, and report workflows#10
Spbd1 wants to merge 1 commit into
codex/build-react-dashboard-with-specified-featuresfrom
codex/build-review,-evaluation,-and-report-workflows-wm0oc0

Conversation

@Spbd1
Copy link
Copy Markdown
Owner

@Spbd1 Spbd1 commented May 18, 2026

Motivation

  • Add user-facing review, benchmark evaluation, and report generation features so analysis outputs can be reviewed, evaluated, and exported from the dashboard and CLI.
  • Provide an append-only, locally persisted review store and validation to support human-in-the-loop corrections for MVP workflows.
  • Provide simple operational evaluation metrics and an evaluation runner to measure model behavior against a small benchmark without implying scientific validation.
  • Provide multi-format report generation (JSON/Markdown/HTML) with storage and download endpoints for sharing analysis results.

Description

  • Implemented review models and validation in engine/argument_risk_engine/review/models.py and append-only JSONL storage + helpers in engine/argument_risk_engine/review/store.py, persisting to data/review/review_store.jsonl and including a legacy-feedback adapter.
  • Added review service and routes that expose GET /review/items, POST /review/items, GET /review/summary, and preserved the legacy POST /review/feedback adapter in backend/app/services/review_service.py and backend/app/api/routes_review.py.
  • Implemented evaluation metrics and runner in engine/argument_risk_engine/evaluation/metrics.py and engine/argument_risk_engine/evaluation/runner.py to compute label precision/recall/F1, false-positive rate, evidence-span exact/partial match, human review rate, over-classification rate, and no-finding rate, and to collect false-positive/false-negative/evidence-span-miss error lists and a non-scientific disclaimer.
  • Added evaluation service and routes POST /evaluation/run, GET /evaluation/summary, and GET /evaluation/errors with result persistence under data/evaluation and a small JSONL mini-benchmark data/benchmarks/mini_eval_set.jsonl.
  • Implemented report renderers engine/argument_risk_engine/reports/{json_export,markdown,html}.py, plus report service with local persistence/indexing and API endpoints POST /reports/from-analysis, GET /reports, GET /reports/{report_id}, and GET /reports/{report_id}/download in backend/app/services/report_service.py and backend/app/api/routes_reports.py.
  • Wired routers into the application for both root and /api prefixes and added a CLI wrapper scripts/run_evaluation.py to run the mini benchmark from the command line.
  • Added and updated tests (tests/test_evaluation.py, tests/test_review_reports_api.py) to cover metrics, runner output, review persistence, report generation, and downloads.

Testing

  • Ran the mini-benchmark CLI via python scripts/run_evaluation.py which executed the runner and printed metrics and the disclaimer (smoke run succeeded).
  • Ran the full test suite via python -m pytest -q which passed (42 passed, 4 warnings).
  • Ran focused tests tests/test_evaluation.py and tests/test_review_reports_api.py which passed (4 passed, 1 warning).
  • Ran static/style checks with python -m ruff check ... as part of CI-style linting and fixed reported issues so lint check passed locally.

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant