Conversation
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (8)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
|
@codex review |
✅ Actions performedReview triggered.
|
|
The PR already exists and doesn't need updating since I didn't change the title or description. Now let me create a comprehensive review summary: Code Review SummaryI've completed a thorough review of PR #259 "test: add Deepchecks regression harness" and fixed a critical test failure. Issues Found and FixedCritical: Test Failure in Deepchecks Drift Detection
Test ResultsPassing Tests:
Environmental Issue (Pre-existing):
Architecture ReviewStrengths:
Concerns:
Recommendations
VerdictThe PR is now ready to merge after the drift threshold fix. The implementation is solid, tests pass (except pre-existing bun issue), and the regression harness will effectively catch embedding model drift and FTS5 ranking changes. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f37b17d5e0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "numpy<2", | ||
| "pytest>=7.0.0", | ||
| "pytest-asyncio>=0.21.0", | ||
| "ruff>=0.1.0", | ||
| "scikit-learn<1.6", |
There was a problem hiding this comment.
Relax dev pins that block Python 3.13 test jobs
CI still runs Python 3.13 in .github/workflows/ci.yml, but this dev extra now pins numpy<2 and scikit-learn<1.6, which excludes the first releases that officially support 3.13. In the 3.13 lane, pip install -e ".[dev]" will then fall back to unsupported/source-build paths (or fail resolution), so the workflow can fail before tests even start. Please make these pins conditional by Python version (or lift them on 3.13) to keep the matrix installable.
Useful? React with 👍 / 👎.
| if [ "${#shell_tests[@]}" -gt 0 ]; then | ||
| for shell_test in "${shell_tests[@]}"; do | ||
| run_step "regression shell $(basename "$shell_test")" bash "$shell_test" |
There was a problem hiding this comment.
Respect BRAINLAYER_USE_UV in regression shell execution
This loop runs every regression shell script unconditionally, but the newly added tests/regression/test_fts5_determinism.sh requires uv/uvx. That means BRAINLAYER_USE_UV=0 no longer provides a UV-free test run: environments that disable or lack UV tooling will still fail in this lane. The regression shell step should either honor BRAINLAYER_USE_UV or explicitly preflight/skip when UV tools are unavailable.
Useful? React with 👍 / 👎.
The K-S test with n=5 samples has a minimum detectable difference of 0.2 due to discrete CDF steps. The embeddings are identical (max diff 2.38e-07) and pass the cosine similarity threshold (>0.999), so the 0.001 threshold was overly strict. Adjusted to 0.25 to match K-S behavior at this sample size. Co-authored-by: Etan Heyman <EtanHey@users.noreply.github.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
| temp_path = Path(temp_file.name) | ||
| temp_file.close() | ||
| create_fixture_db(temp_path) | ||
| return temp_path |
There was a problem hiding this comment.
Unused helper exported from regression fixture
Low Severity
The new create_temp_fixture_db helper has no callers anywhere in the repository. The Python regression test uses create_fixture_db directly with a known path, and the shell test invokes create_fixture_db plus write_expected_ranking_json from inline Python. Because delete=False is set on the underlying NamedTemporaryFile, any future caller would also need an explicit cleanup path that does not currently exist.
Reviewed by Cursor Bugbot for commit 8160680. Configure here.
| ] | ||
| dev = [ | ||
| "deepchecks>=0.19.1", | ||
| "numpy<2", |
There was a problem hiding this comment.
numpy<2 pin breaks Python 3.13 install
High Severity
The new dev extra constraint numpy<2 is incompatible with Python 3.13, which is included in the CI matrix. NumPy 1.x (latest 1.26.x) only ships wheels for Python 3.9–3.12 and refuses to build from source on 3.13, so pip install -e ".[dev]" will fail on the 3.13 job before any tests run. The runtime dependency numpy>=1.22,<3.0 previously allowed NumPy 2.x, which does support 3.13.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 8160680. Configure here.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 4 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c9b8c58. Configure here.
| with_display=False, | ||
| ) | ||
|
|
||
| assert result.passed_conditions() |
There was a problem hiding this comment.
Drift test loads heavy model in CI default lane
Medium Severity
test_fixture_embeddings_pass_deepchecks_and_cosine_threshold calls current_embedding_rows(), which triggers EmbeddingModel._load_model() and downloads the full BAAI/bge-large-en-v1.5 weights (~1.3 GB) from HuggingFace, then runs inference on CPU. The test carries no slow, live, or integration marker, so the new CI "Unit tests" step runs it on every push/PR across three Python versions, materially increasing CI runtime and requiring outbound network access. The project's own slow marker (defined in pyproject.toml) explicitly targets "tests that load ML models", which this test does.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit c9b8c58. Configure here.
| with_display=False, | ||
| ) | ||
|
|
||
| assert result.passed_conditions() |
There was a problem hiding this comment.
KS drift threshold near floor risks flaky failures
Medium Severity
With only 5 rows per dataset, the KS statistic for any feature can take values 0, 0.2, 0.4, 0.6, 0.8, or 1.0. Setting max_allowed_numeric_score=0.21 allows the 0.2 floor but fails for 0.4. Across 1024 numeric features, the condition requires every feature to stay at 0 or 0.2 — even minor float-level differences caused by CPU vs. MPS, BLAS variants, or sentence-transformers version drift can push several dimensions to 0.4, causing the condition to fail despite a near-perfect cosine similarity. This makes the gate fragile in CI environments that differ from the fixture-generation host.
Reviewed by Cursor Bugbot for commit c9b8c58. Configure here.




Summary
tests/fixtures/stale_index_query.jsonto validate embedding cosine similarity against the baseline fixturesqlite-utils query+diff, and wire it intoscripts/run_tests.shlivetests from the default lane and isolating the Ranx + hook-routing suites that only pass reliably in separate pytest processesTest plan
uv run --extra dev ruff check src/ tests/python3 - <<'PY'YAML parse of.github/workflows/ci.ymluv run --extra dev pytest tests/test_run_tests_script.py tests/regression/test_drift_detection.py -qbash tests/regression/test_fts5_determinism.shBRAINLAYER_USE_UV=1 bash scripts/run_tests.shNote
Medium Risk
Moderate risk because it changes CI/test execution behavior (new
livemarker, splitting suites, new shell-based gates) and adds heavier dev dependencies (deepchecks, pinnednumpy/scikit-learn) that can introduce version conflicts or CI flakiness.Overview
Adds a new
tests/regressionharness that (1) re-embeds a seeded fixture and asserts cosine similarity plus DeepchecksFeatureDriftthresholds, and (2) runs a shell-based SQLite FTS5 snapshot query and diffs the normalized results for determinism.Updates
scripts/run_tests.shand GitHub Actions CI to excludelive-marked tests by default, run a few flaky eval/hook-routing tests in separate pytest invocations, run regression*.shscripts automatically, and ensureuvruns pytest with thedevextra. Also extendspyproject.tomlwith the newlivepytest marker and adds dev dependency pins (deepchecks,numpy<2,scikit-learn<1.6,setuptools).Reviewed by Cursor Bugbot for commit c9b8c58. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Add Deepchecks regression harness for embedding drift detection
tests/regression/that checks embedding model drift using DeepchecksFeatureDrift(drift score < 0.21) and cosine similarity thresholds against a seeded fixture.test_fts5_determinism.sh) that seeds a SQLite FTS5 database from the fixture and diffs query results against expected JSON.scripts/run_tests.shto discover and run regression shell tests, run isolated pytest files separately, and support a configurable mark expression (defaulting tonot integration and not live).deepchecks>=0.19.1,numpy<2,scikit-learn<1.6, and constrainedsetuptoolsto thedevextra inpyproject.toml, and registers a newlivepytest marker.live-marked tests from the main unit suite and run eval/hook routing tests in a separate isolated step.Macroscope summarized c9b8c58.