Skip to content

fix(rate-limiter): production hardening — TLS support for managed Redis#74

Open
gandhipratik203 wants to merge 6 commits intomainfrom
fix/rate-limiter-tls-support-and-wipe-on-disable
Open

fix(rate-limiter): production hardening — TLS support for managed Redis#74
gandhipratik203 wants to merge 6 commits intomainfrom
fix/rate-limiter-tls-support-and-wipe-on-disable

Conversation

@gandhipratik203
Copy link
Copy Markdown
Collaborator

@gandhipratik203 gandhipratik203 commented May 1, 2026

Summary

Enables TLS for the rate limiter's Redis backend so deployments against managed Redis services with in-transit encryption (AWS ElastiCache, Redis Cloud, etc.) work end-to-end. Bumps cpex_rate_limiter 0.0.4 → 0.0.6.


Gaps closed

Gap A (HIGH, wo-tracker #68217) — The Redis backend cannot connect to managed Redis services that require TLS. Operators using AWS ElastiCache with in-transit encryption (or any other rediss:// URL) can configure the binding successfully — the binding API returns HTTP 200 — but the plugin's backend init raises InvalidClientConfig: can't connect with TLS, the feature is not enabled. The plugin manager logs Failed to load plugin RateLimiterPlugin and silently skips it under the default fail_on_plugin_error: false setting, leaving rate limits unenforced while the operator believes them to be in place. Fixed by compiling the redis crate with rustls-based TLS features.

Gap B (MEDIUM) — The redis crate version previously pinned (0.27) carried rustls-pemfile 2.2.0 transitively, which cargo-deny flagged as unmaintained (RUSTSEC-2025-0134) and blocked CI's security-policy stage. Fixed by bumping the redis crate to 0.32, where the rustls-native-certs upgrade dropped the rustls-pemfile dep entirely.

Gap C (MEDIUM) — rustls 0.23 dropped its implicit default crypto provider; the redis crate's tls-rustls feature does not pick one. Without an explicit install, the first TLS handshake panicked with Call CryptoProvider::install_default() before this point.... The Python shim's broad except caught the panic and fail_mode=open allowed the request through, so deployments would still appear to load the plugin while quietly enforcing nothing — same observable end state as Gap A, different log message. Fixed by installing the ring crypto provider once at plugin init via a OnceLock-guarded call in RateLimiterPluginCore::new().


Test results

Full local gate
cargo fmt -- --check                  clean
cargo clippy -- -D warnings           clean
cargo test --lib                      57 passed
uv run pytest tests/rate_limiter/     107 passed
New tests added in this PR

TestRedisTlsSupport — regression coverage for wo-tracker #68217

Test Target
test_rediss_url_does_not_fail_with_tls_not_enabled Constructs the plugin with a rediss:// URL; asserts no InvalidClientConfig is raised at backend init. Pins Gap A.
test_rediss_url_lazy_handshake_reaches_connectivity_layer Drives a tool_pre_invoke against a closed rediss:// port so the lazy TLS handshake actually engages; asserts the captured warning does not contain the wo-tracker #68217 signature (feature is not enabled / InvalidClientConfig). Pins Gap A and Gap C jointly.

Migration notes

  • No new required config. Existing deployments using redis:// URLs continue to work unchanged. rediss:// URLs now work where they previously failed.
  • CA bundle expectation. The plugin reads CA roots from the host OS trust store at runtime via rustls-native-certs. Container images that omit ca-certificates (some distroless / scratch / unconfigured Alpine variants) would need to install the bundle before TLS endpoints can be verified. UBI Minimal — what mcp-context-forge ships on — already includes the CA bundle by default; AWS ElastiCache uses public-CA chains that are present there.

Follow-ups

  • Tighten the TLS lazy-handshake test's positive assertion (currently scoped to the wo-tracker #68217 negative signature only). Would require the Rust core's log_exception() to surface the underlying RedisError text instead of the generic "error; allowing request" message — the error details are dropped at the Python boundary today.
  • Real-TLS-Redis fixture for end-to-end handshake verification (self-signed cert + CA-pinning fixture). Heavier than the connectivity-shape variant; deferred until needed.
  • Companion main-repo PR to bump the cpex-rate-limiter pin to 0.0.5 once published to PyPI.
  • Wipe-on-disable feature (counter cleanup when an operator transitions the plugin to mode=disabled) and the round-1/round-2 review feedback addressing it have been moved to a separate branch (feat/rate-limiter-wipe-on-disable-only); a follow-up PR will be linked here once opened.

@gandhipratik203 gandhipratik203 self-assigned this May 1, 2026
@gandhipratik203 gandhipratik203 added the enhancement New feature or request label May 1, 2026
Copy link
Copy Markdown
Collaborator

@lucarlig lucarlig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detailed review pass found no Critical issues. I left five P2 findings covering Redis wipe safety/performance, lifecycle coupling, and test coverage gaps.

Comment thread plugins/rust/python-package/rate_limiter/cpex_rate_limiter/rate_limiter.py Outdated
Comment thread plugins/rust/python-package/rate_limiter/cpex_rate_limiter/rate_limiter.py Outdated
Comment thread plugins/tests/rate_limiter/test_redis_integration.py
Comment thread plugins/tests/rate_limiter/test_redis_integration.py Outdated
@msureshkumar88
Copy link
Copy Markdown
Collaborator

Code Review — PR #74: production hardening (TLS + wipe-on-disable)

Overview

Closes two real gaps in the Redis path:

  • Gap A (HIGH): rediss:// URLs blow up at plugin init because the redis crate was compiled without TLS features. Fixed by enabling tls-rustls* + tokio-rustls-comp.
  • Gap B (MEDIUM): Counter state survives a disable→re-enable cycle, causing unexpected immediate throttling. Fixed by reading plugin:<name>:mode from Redis at shutdown and wiping <prefix>:* when the value is "disabled".

Architecture is sound and the wipe contract (conditional, fail-safe, idempotent, non-blocking on core shutdown) is correctly implemented.


Blocking Issues

1. security-policy CI failure — unmaintained rustls-pemfile (RUSTSEC-2025-0134)

cargo-deny rejects the build because tls-rustls pulls in rustls-native-certsrustls-pemfile, which was archived in August 2025 with no safe upgrade path.

The tls-rustls feature is redundant here. The PR's stated goal ("managed services with public-CA certificates work without further configuration") is fully satisfied by tls-rustls-webpki-roots alone — it bundles Mozilla CA roots and doesn't touch rustls-native-certs.

Fix: drop tls-rustls from Cargo.toml:

# before
redis = { version = "0.27", features = ["aio", "tokio-comp", "tls-rustls", "tls-rustls-webpki-roots", "tokio-rustls-comp"] }

# after — removes rustls-native-certs → rustls-pemfile from the dep tree
redis = { version = "0.27", features = ["aio", "tokio-comp", "tls-rustls-webpki-roots", "tokio-rustls-comp"] }

2. mutation-testing (rate_limiter) CI failure — stale diff baseline

cargo-mutants errors with:

diff has:   "        Python::attach( |py| -> PyResult<()> {"
source has: "        // Ensure the embedded interpreter is initialized for this test process."
The diff might be out of date with this source tree.

The action computes the diff between BASE_SHA and HEAD_SHA. If the branch was rebased after the run started, or if BASE_SHA points to a commit whose tree no longer matches the working tree, cargo-mutants rejects the diff. A fresh push after the rustls-pemfile fix above should regenerate a consistent baseline. If it persists, verify BASE_SHA in the CI workflow resolves to the correct merge-base.

(mutation-testing (secrets_detection) failure appears unrelated to this PR — no secrets_detection files changed.)


Non-Blocking Observations

Python connection overhead on shutdown

_mode_in_redis_says_disabled() opens a brand-new redis.asyncio client, issues one GET, then closes it. The Rust core holds an already-open connection, but it's not reachable from Python at shutdown time. This is fine in practice (shutdown is rare), but worth a brief inline comment explaining why a new client is opened instead of reusing the Rust one.

Type annotation

batch: list = []

Should be list[str | bytes] (or whatever aioredis.scan_iter yields) to keep the type checker happy and signal intent.

Deferred import

try:
    import redis.asyncio as aioredis  # noqa: PLC0415
except Exception:
    return False

redis is now a declared dep in pyproject.toml, so the ImportError guard is dead code. A top-level import is cleaner. The PLC0415 suppression hint can also be dropped.

Batch size magic number

The 500-key batch in _wipe_my_counters() is reasonable but uncommented. Worth a one-liner: # 500 keys per DEL keeps round-trip count low without risking a single oversized command.


Test Coverage

Good. The six TestWipeOnDisable cases cover the full conditional matrix cleanly, and the concurrent-shutdown test is exactly the right thing to have. One gap worth considering in a follow-up: a test that asserts SCAN is used instead of KEYS (mock the redis client and verify scan_iter was called, not keys). Not a blocker for this PR since the implementation visibly uses scan_iter, but it would prevent regression if someone refactors the wipe path.


Summary

Two blockers before merge:

  1. Remove tls-rustls feature to eliminate the rustls-pemfile advisory and unblock security-policy.
  2. Confirm mutation-testing passes after a fresh push (likely resolves itself once (1) is fixed and a new run is triggered with a fresh diff baseline).

Code quality is good. The wipe logic is correct, the fail-safe properties are well-reasoned, and the README/migration notes are clear. Approve pending the two CI fixes.

gandhipratik203 added a commit that referenced this pull request May 1, 2026
…efix + TLS test depth)

Addresses four P2 findings from Luca's review on PR #74. P2-3 (lifecycle
coupling — move wipe into the Rust core and add an on_mode_change hook
to the framework) is deliberately deferred as a separate follow-up; it
requires framework-level changes in mcp-context-forge that are out of
scope for this PR.

P2-1 — single-flight guard around the wipe.
  Without coordination, every worker that sees mode=disabled runs the
  full SCAN+DEL independently. With 3 replicas × 24 workers = 72
  workers all wiping the same keyspace, that's ~72× the Redis work
  and contention on the same keys. Added _acquire_wipe_lock() /
  _release_wipe_lock() using SET ... NX EX 30 — only the worker that
  wins the lock performs the wipe; the rest skip cleanly.
  Lock key uses a dash separator (`<prefix>-wipe-lock`) so the wipe
  SCAN's `<prefix>:*` pattern doesn't match it. TTL bounds starvation
  if the lock holder dies mid-wipe; the lock is released eagerly after
  a successful wipe so the next disable cycle is responsive.

P2-2 — UNLINK with DEL fallback.
  SCAN avoids blocking key enumeration, but DEL still frees values on
  Redis' main thread. For sliding-window sorted sets, dropping 500 of
  those at once can stutter Redis. Switched to UNLINK (Redis 4.0+,
  defers value-free to a background thread) with a ResponseError
  fallback to DEL for pre-4.0 deployments.

P2-4 — TLS handshake-depth test.
  The existing rediss:// regression test only verifies URL parsing.
  Added test_rediss_url_lazy_handshake_reaches_connectivity_layer that
  drives a tool_pre_invoke and asserts the failure logs are
  connectivity-shaped (refused / IO / timed out), explicitly NOT
  "feature is not enabled" / InvalidClientConfig — the latter being
  the wo-tracker #68217 regression signature. Catches a future Cargo
  feature-flag mistake or a rustls regression that the construction
  test would silently pass.

P2-5 — custom redis_key_prefix coverage.
  Every existing test uses the default prefix, so a regression that
  hard-coded "rl:*" would still pass. Added
  test_wipe_only_deletes_keys_under_configured_prefix that seeds keys
  under a custom prefix and an unrelated namespace, then asserts only
  the configured-prefix keys are deleted.

Local gate: cargo fmt-check + clippy + 57 Rust tests; pytest 115/115
(112 prior + lock-skipped + custom-prefix + TLS-handshake-depth = 115).

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Copy link
Copy Markdown
Collaborator

@lucarlig lucarlig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up detailed review pass on the latest head found three remaining P2 findings.

Comment thread plugins/rust/python-package/rate_limiter/cpex_rate_limiter/rate_limiter.py Outdated
Comment thread plugins/rust/python-package/rate_limiter/cpex_rate_limiter/rate_limiter.py Outdated
Comment thread plugins/tests/rate_limiter/test_redis_integration.py
gandhipratik203 added a commit that referenced this pull request May 1, 2026
…efix + TLS test depth)

Addresses four P2 findings from Luca's review on PR #74. P2-3 (lifecycle
coupling — move wipe into the Rust core and add an on_mode_change hook
to the framework) is deliberately deferred as a separate follow-up; it
requires framework-level changes in mcp-context-forge that are out of
scope for this PR.

P2-1 — single-flight guard around the wipe.
  Without coordination, every worker that sees mode=disabled runs the
  full SCAN+DEL independently. With 3 replicas × 24 workers = 72
  workers all wiping the same keyspace, that's ~72× the Redis work
  and contention on the same keys. Added _acquire_wipe_lock() /
  _release_wipe_lock() using SET ... NX EX 30 — only the worker that
  wins the lock performs the wipe; the rest skip cleanly.
  Lock key uses a dash separator (`<prefix>-wipe-lock`) so the wipe
  SCAN's `<prefix>:*` pattern doesn't match it. TTL bounds starvation
  if the lock holder dies mid-wipe; the lock is released eagerly after
  a successful wipe so the next disable cycle is responsive.

P2-2 — UNLINK with DEL fallback.
  SCAN avoids blocking key enumeration, but DEL still frees values on
  Redis' main thread. For sliding-window sorted sets, dropping 500 of
  those at once can stutter Redis. Switched to UNLINK (Redis 4.0+,
  defers value-free to a background thread) with a ResponseError
  fallback to DEL for pre-4.0 deployments.

P2-4 — TLS handshake-depth test.
  The existing rediss:// regression test only verifies URL parsing.
  Added test_rediss_url_lazy_handshake_reaches_connectivity_layer that
  drives a tool_pre_invoke and asserts the failure logs are
  connectivity-shaped (refused / IO / timed out), explicitly NOT
  "feature is not enabled" / InvalidClientConfig — the latter being
  the wo-tracker #68217 regression signature. Catches a future Cargo
  feature-flag mistake or a rustls regression that the construction
  test would silently pass.

P2-5 — custom redis_key_prefix coverage.
  Every existing test uses the default prefix, so a regression that
  hard-coded "rl:*" would still pass. Added
  test_wipe_only_deletes_keys_under_configured_prefix that seeds keys
  under a custom prefix and an unrelated namespace, then asserts only
  the configured-prefix keys are deleted.

Local gate: cargo fmt-check + clippy + 57 Rust tests; pytest 115/115
(112 prior + lock-skipped + custom-prefix + TLS-handshake-depth = 115).

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
@gandhipratik203 gandhipratik203 force-pushed the fix/rate-limiter-tls-support-and-wipe-on-disable branch from 2b453b6 to a477af2 Compare May 1, 2026 13:53
Comment thread deny.toml Outdated
gandhipratik203 added a commit that referenced this pull request May 1, 2026
Addresses Luca's P2-4 review feedback on PR #74: the construction-only
TLS regression test verifies URL parsing but not the lazy handshake.
Added test_rediss_url_lazy_handshake_reaches_connectivity_layer that
drives a tool_pre_invoke against rediss://127.0.0.1:1/15 so the lazy
TLS path actually engages, then asserts the failure is connectivity-
shaped (refused / timed out / IO) rather than the wo-tracker #68217
'feature is not enabled' / InvalidClientConfig signature.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
@gandhipratik203 gandhipratik203 force-pushed the fix/rate-limiter-tls-support-and-wipe-on-disable branch from a477af2 to f5d306c Compare May 1, 2026 14:27
@gandhipratik203 gandhipratik203 changed the title fix(rate-limiter): production hardening — TLS support and counter cleanup on disable fix(rate-limiter): production hardening — TLS support for managed Redis May 1, 2026
The redis crate was compiled without any TLS feature, so passing a
rediss:// URL (e.g. AWS ElastiCache with in-transit encryption, Redis
Cloud) failed at backend init with `InvalidClientConfig: can't connect
with TLS, the feature is not enabled`. The plugin manager then logged
"Failed to load plugin RateLimiterPlugin" and silently skipped the
plugin — bindings appeared configured but no rate limiting was applied.

Compile the redis dep with rustls-based TLS (pure Rust, no OpenSSL):
  - tls-rustls
  - tls-rustls-webpki-roots  (Mozilla CA roots, needed for managed
                              Redis services without custom trust setup)
  - tokio-rustls-comp        (async support over rustls)

Added a regression test (TestRedisTlsSupport) that constructs a plugin
with a rediss:// URL and asserts no InvalidClientConfig is raised.

Bumps cpex_rate_limiter 0.0.4 → 0.0.5.

Refs: wo-tracker #68217
Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
… dep

cargo-deny security-policy CI was red on this branch with two findings:

  1. RUSTSEC-2025-0134 — rustls-pemfile 2.2.0 is unmaintained.
     Pulled in by redis 0.27 → rustls-native-certs 0.7.3.

  2. License rejected — webpki-roots is CDLA-Permissive-2.0,
     which is not in the workspace deny.toml allow-list.

Both come from the TLS feature additions in the previous commit.

Bumping the redis crate to 0.32 takes us to rustls-native-certs 0.8.x,
which migrated to rustls-pki-types and dropped the rustls-pemfile
dependency entirely. The advisory is closed at the root rather than
suppressed via deny.toml.

The license rejection is fixed by adding CDLA-Permissive-2.0 to the
workspace allow-list. CDLA-Permissive-2.0 is a Linux-Foundation
permissive data license used by webpki-roots for the Mozilla CA
bundle — keeping the bundle pinned in the binary keeps deployments
self-contained (no host-OS CA-store dependency in containers).

No source-code changes — redis 0.27 → 0.32 is API-compatible for the
APIs the Rust core uses (Client::open, MultiplexedConnection, cmd
builder, redis::Value pattern matching with _ arms, ErrorKind variants
we touch). All 47 redis call sites in redis_backend.rs compile
unchanged. Local gate green: cargo check-all (57 Rust tests),
pytest (112 Python tests).

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Addresses Luca's P2-4 review feedback on PR #74: the construction-only
TLS regression test verifies URL parsing but not the lazy handshake.
Added test_rediss_url_lazy_handshake_reaches_connectivity_layer that
drives a tool_pre_invoke against rediss://127.0.0.1:1/15 so the lazy
TLS path actually engages, then asserts the failure is connectivity-
shaped (refused / timed out / IO) rather than the wo-tracker #68217
'feature is not enabled' / InvalidClientConfig signature.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…me path

Two related changes that together make `rediss://` URLs actually work
end-to-end:

  1. Install the rustls ring crypto provider once at Rust core init.

     Previous TLS commits made `rediss://` URLs parse and the plugin
     load without the wo-tracker #68217 "feature is not enabled" error,
     but they left a runtime-time gap: rustls 0.23 dropped its implicit
     default crypto provider, so the first real TLS handshake panicked
     with `Call CryptoProvider::install_default()...`.

     Added a direct dep on rustls 0.23 with the `ring` feature, plus an
     OnceLock-guarded `ensure_crypto_provider()` in the Rust core that
     calls `rustls::crypto::ring::default_provider().install_default()`
     exactly once per process. Called from `RateLimiterPluginCore::new()`
     before any redis client is constructed.

     `ring` is the simpler default — pure Rust + small asm, BSD-3 / ISC
     licensed, builds cleanly across the existing PyPI wheel matrix
     without extra system toolchains. aws-lc-rs would be needed for
     FIPS validation; not a requirement for this customer.

  2. Drop the `tls-rustls-webpki-roots` feature on the redis crate.

     That feature pulled `webpki-roots` (Mozilla's CA bundle, licensed
     CDLA-Permissive-2.0) into the dep tree, which the workspace
     deny.toml correctly rejected because that license is not in the
     allow list. Adding the license to the allow list (or as an
     exception) widens the workspace policy for one transitive crate.

     Without `tls-rustls-webpki-roots`, the redis crate uses
     `rustls-native-certs` (Apache-2.0 / MIT / ISC — already allowed)
     to read CA certificates from the host OS trust store at runtime.
     mcp-context-forge ships on UBI 10 Minimal, which includes
     `ca-certificates` by default; AWS ElastiCache uses public CA
     chains that are in any standard OS bundle. Operationally
     equivalent for this deployment, with a clean license posture.

     Reverts the `CDLA-Permissive-2.0` entry from the workspace
     deny.toml allow list — no longer needed.

Test: the existing TLS lazy-handshake regression test
(test_rediss_url_lazy_handshake_reaches_connectivity_layer) now reaches
the network layer instead of panicking at TLS init. Its assertion is
scoped to the wo-tracker #68217 negative signature ("feature is not
enabled" / InvalidClientConfig must never appear). A positive "the lazy
handshake actually reached the network" assertion would require the
Rust core's log_exception() to surface the underlying RedisError text
instead of the generic "error; allowing request" message — tracked as
a follow-up alongside the real-TLS-Redis-fixture variant.

Local gate: cargo fmt-check + clippy clean, 57 Rust unit tests pass,
107 Python integration tests pass.

Cargo.lock confirms removal of unmaintained `rustls-pemfile` and
license-flagged `webpki-roots`; `rustls-native-certs` 0.8.3 is the only
remaining TLS-related dep.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Main bumped rate_limiter to 0.0.5 in #73 as part of a workspace-wide
dependency refresh, so this PR's release slot moves to 0.0.6. The
content of 0.0.6 is the TLS / rediss:// support work in this PR
(crypto provider install, redis crate bump for advisory cleanup,
TLS regression tests).

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
@gandhipratik203 gandhipratik203 force-pushed the fix/rate-limiter-tls-support-and-wipe-on-disable branch from bdc0740 to 1969f17 Compare May 1, 2026 15:23
…unit test

cargo-mutants flagged an uncovered mutation: replacing
ensure_crypto_provider's body with () passed the existing test suite,
because nothing in the Rust unit tests directly exercises the crypto
provider install (TLS coverage lives on the Python integration side).

Added a unit test in plugin.rs's tests module that calls
ensure_crypto_provider() and asserts
rustls::crypto::CryptoProvider::get_default().is_some().

  - Real implementation installs the ring provider via OnceLock
    → get_default() returns Some → test passes.
  - Mutated implementation does nothing → no provider installed
    → get_default() returns None → test fails → mutant killed.

OnceLock makes the call idempotent, so the test is order-independent
within a cargo-test process. No behavioral change; pure coverage.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
@gandhipratik203 gandhipratik203 requested a review from lucarlig May 1, 2026 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants