Skip to content

test: add Deepchecks regression harness#259

Merged
EtanHey merged 4 commits intomainfrom
feat/p5a-deepchecks
Apr 27, 2026
Merged

test: add Deepchecks regression harness#259
EtanHey merged 4 commits intomainfrom
feat/p5a-deepchecks

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Apr 27, 2026

Summary

  • add a Python Deepchecks regression over tests/fixtures/stale_index_query.json to validate embedding cosine similarity against the baseline fixture
  • add a shell FTS5 determinism check using sqlite-utils query + diff, and wire it into scripts/run_tests.sh
  • harden the local/CI harness by excluding live tests from the default lane and isolating the Ranx + hook-routing suites that only pass reliably in separate pytest processes

Test plan

  • uv run --extra dev ruff check src/ tests/
  • python3 - <<'PY' YAML parse of .github/workflows/ci.yml
  • uv run --extra dev pytest tests/test_run_tests_script.py tests/regression/test_drift_detection.py -q
  • bash tests/regression/test_fts5_determinism.sh
  • BRAINLAYER_USE_UV=1 bash scripts/run_tests.sh

Note

Medium Risk
Moderate risk because it changes CI/test execution behavior (new live marker, splitting suites, new shell-based gates) and adds heavier dev dependencies (deepchecks, pinned numpy/scikit-learn) that can introduce version conflicts or CI flakiness.

Overview
Adds a new tests/regression harness that (1) re-embeds a seeded fixture and asserts cosine similarity plus Deepchecks FeatureDrift thresholds, and (2) runs a shell-based SQLite FTS5 snapshot query and diffs the normalized results for determinism.

Updates scripts/run_tests.sh and GitHub Actions CI to exclude live-marked tests by default, run a few flaky eval/hook-routing tests in separate pytest invocations, run regression *.sh scripts automatically, and ensure uv runs pytest with the dev extra. Also extends pyproject.toml with the new live pytest marker and adds dev dependency pins (deepchecks, numpy<2, scikit-learn<1.6, setuptools).

Reviewed by Cursor Bugbot for commit c9b8c58. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Add Deepchecks regression harness for embedding drift detection

  • Adds a regression test suite under tests/regression/ that checks embedding model drift using Deepchecks FeatureDrift (drift score < 0.21) and cosine similarity thresholds against a seeded fixture.
  • Adds a shell-based FTS5 determinism test (test_fts5_determinism.sh) that seeds a SQLite FTS5 database from the fixture and diffs query results against expected JSON.
  • Updates scripts/run_tests.sh to discover and run regression shell tests, run isolated pytest files separately, and support a configurable mark expression (defaulting to not integration and not live).
  • Adds deepchecks>=0.19.1, numpy<2, scikit-learn<1.6, and constrained setuptools to the dev extra in pyproject.toml, and registers a new live pytest marker.
  • Updates the CI workflow to exclude live-marked tests from the main unit suite and run eval/hook routing tests in a separate isolated step.

Macroscope summarized c9b8c58.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Warning

Rate limit exceeded

@EtanHey has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 16 minutes and 36 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 606e0a64-0a2d-4e50-96e6-f66564a52764

📥 Commits

Reviewing files that changed from the base of the PR and between 99b5a05 and c9b8c58.

📒 Files selected for processing (8)
  • .github/workflows/ci.yml
  • pyproject.toml
  • scripts/run_tests.sh
  • tests/regression/__init__.py
  • tests/regression/_stale_index_fixture.py
  • tests/regression/test_drift_detection.py
  • tests/regression/test_fts5_determinism.sh
  • tests/test_run_tests_script.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/p5a-deepchecks

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented Apr 27, 2026

@coderabbitai review

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented Apr 27, 2026

@cursor @BugBot review

@EtanHey
Copy link
Copy Markdown
Owner Author

EtanHey commented Apr 27, 2026

@codex review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 27, 2026

The PR already exists and doesn't need updating since I didn't change the title or description. Now let me create a comprehensive review summary:

Code Review Summary

I've completed a thorough review of PR #259 "test: add Deepchecks regression harness" and fixed a critical test failure.

Issues Found and Fixed

Critical: Test Failure in Deepchecks Drift Detection

  • The test_fixture_embeddings_pass_deepchecks_and_cosine_threshold was failing due to an overly strict K-S drift threshold (0.001) with only 5 samples
  • With n=5, the K-S test has a minimum detectable difference of 0.2 due to discrete empirical CDF steps
  • The embeddings were actually identical (max difference: 2.38e-07, cosine similarity: 1.0)
  • Fix: Adjusted threshold to 0.25 to match K-S statistical behavior at this sample size
  • Committed as: fix(test): relax Deepchecks drift threshold for small sample size

Test Results

Passing Tests:

  • ✅ Linting: ruff check src/ tests/ - All checks passed
  • ✅ Regression drift detection: Now passes with corrected threshold
  • ✅ FTS5 determinism shell test: Passes cleanly
  • ✅ Unit test suite: 1773 passed, 10 skipped (excludes live tests as intended)
  • ✅ MCP tool registration: All 3 tests passed
  • ✅ Isolated eval/hook routing: All 32 tests passed

Environmental Issue (Pre-existing):

  • ❌ Bun tests fail (bun not installed in CI environment)
  • This affects tests/stale_index_query.test.ts but is not introduced by this PR

Architecture Review

Strengths:

  1. Well-structured regression harness: Separate tests/regression/ package with clear fixture helpers
  2. Dual validation approach: Both statistical (Deepchecks) and deterministic (FTS5 shell) checks
  3. Proper test isolation: live marker correctly excludes tests needing production DB
  4. Script integration: run_tests.sh now auto-discovers regression shell scripts
  5. Fixture design: stale_index_query.json serves both Python and TypeScript tests

Concerns:

  1. Sample size limitation: 5 samples is minimal for statistical drift detection - consider documenting this limitation
  2. Dependency pins: numpy<2, scikit-learn<1.6 may conflict with future updates
  3. Missing bun in CI: The TypeScript regression test can't run without bun installed

Recommendations

  1. Document the K-S threshold rationale in a comment near the test or in test documentation
  2. Consider increasing fixture samples to 10-20 for more robust statistical testing
  3. Add bun to CI workflow to enable TypeScript regression tests
  4. Monitor dependency pins - the setuptools<81 and version constraints may need adjustment

Verdict

The PR is now ready to merge after the drift threshold fix. The implementation is solid, tests pass (except pre-existing bun issue), and the regression harness will effectively catch embedding model drift and FTS5 ranking changes.

Open in Web Open in Cursor 

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f37b17d5e0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread pyproject.toml
Comment on lines +81 to +85
"numpy<2",
"pytest>=7.0.0",
"pytest-asyncio>=0.21.0",
"ruff>=0.1.0",
"scikit-learn<1.6",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Relax dev pins that block Python 3.13 test jobs

CI still runs Python 3.13 in .github/workflows/ci.yml, but this dev extra now pins numpy<2 and scikit-learn<1.6, which excludes the first releases that officially support 3.13. In the 3.13 lane, pip install -e ".[dev]" will then fall back to unsupported/source-build paths (or fail resolution), so the workflow can fail before tests even start. Please make these pins conditional by Python version (or lift them on 3.13) to keep the matrix installable.

Useful? React with 👍 / 👎.

Comment thread scripts/run_tests.sh
Comment on lines +123 to +125
if [ "${#shell_tests[@]}" -gt 0 ]; then
for shell_test in "${shell_tests[@]}"; do
run_step "regression shell $(basename "$shell_test")" bash "$shell_test"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Respect BRAINLAYER_USE_UV in regression shell execution

This loop runs every regression shell script unconditionally, but the newly added tests/regression/test_fts5_determinism.sh requires uv/uvx. That means BRAINLAYER_USE_UV=0 no longer provides a UV-free test run: environments that disable or lack UV tooling will still fail in this lane. The regression shell step should either honor BRAINLAYER_USE_UV or explicitly preflight/skip when UV tools are unavailable.

Useful? React with 👍 / 👎.

The K-S test with n=5 samples has a minimum detectable difference of 0.2
due to discrete CDF steps. The embeddings are identical (max diff 2.38e-07)
and pass the cosine similarity threshold (>0.999), so the 0.001 threshold
was overly strict. Adjusted to 0.25 to match K-S behavior at this sample size.

Co-authored-by: Etan Heyman <EtanHey@users.noreply.github.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

temp_path = Path(temp_file.name)
temp_file.close()
create_fixture_db(temp_path)
return temp_path
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused helper exported from regression fixture

Low Severity

The new create_temp_fixture_db helper has no callers anywhere in the repository. The Python regression test uses create_fixture_db directly with a known path, and the shell test invokes create_fixture_db plus write_expected_ranking_json from inline Python. Because delete=False is set on the underlying NamedTemporaryFile, any future caller would also need an explicit cleanup path that does not currently exist.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8160680. Configure here.

Comment thread pyproject.toml
]
dev = [
"deepchecks>=0.19.1",
"numpy<2",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy<2 pin breaks Python 3.13 install

High Severity

The new dev extra constraint numpy<2 is incompatible with Python 3.13, which is included in the CI matrix. NumPy 1.x (latest 1.26.x) only ships wheels for Python 3.9–3.12 and refuses to build from source on 3.13, so pip install -e ".[dev]" will fail on the 3.13 job before any tests run. The runtime dependency numpy>=1.22,<3.0 previously allowed NumPy 2.x, which does support 3.13.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8160680. Configure here.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 4 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c9b8c58. Configure here.

with_display=False,
)

assert result.passed_conditions()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drift test loads heavy model in CI default lane

Medium Severity

test_fixture_embeddings_pass_deepchecks_and_cosine_threshold calls current_embedding_rows(), which triggers EmbeddingModel._load_model() and downloads the full BAAI/bge-large-en-v1.5 weights (~1.3 GB) from HuggingFace, then runs inference on CPU. The test carries no slow, live, or integration marker, so the new CI "Unit tests" step runs it on every push/PR across three Python versions, materially increasing CI runtime and requiring outbound network access. The project's own slow marker (defined in pyproject.toml) explicitly targets "tests that load ML models", which this test does.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c9b8c58. Configure here.

with_display=False,
)

assert result.passed_conditions()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KS drift threshold near floor risks flaky failures

Medium Severity

With only 5 rows per dataset, the KS statistic for any feature can take values 0, 0.2, 0.4, 0.6, 0.8, or 1.0. Setting max_allowed_numeric_score=0.21 allows the 0.2 floor but fails for 0.4. Across 1024 numeric features, the condition requires every feature to stay at 0 or 0.2 — even minor float-level differences caused by CPU vs. MPS, BLAS variants, or sentence-transformers version drift can push several dimensions to 0.4, causing the condition to fail despite a near-perfect cosine similarity. This makes the gate fragile in CI environments that differ from the fixture-generation host.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c9b8c58. Configure here.

@EtanHey EtanHey merged commit b68de99 into main Apr 27, 2026
7 checks passed
@EtanHey EtanHey deleted the feat/p5a-deepchecks branch April 27, 2026 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants