fix(rate-limiter): production hardening — TLS support for managed Redis#74
fix(rate-limiter): production hardening — TLS support for managed Redis#74gandhipratik203 wants to merge 6 commits intomainfrom
Conversation
lucarlig
left a comment
There was a problem hiding this comment.
Detailed review pass found no Critical issues. I left five P2 findings covering Redis wipe safety/performance, lifecycle coupling, and test coverage gaps.
Code Review — PR #74: production hardening (TLS + wipe-on-disable)OverviewCloses two real gaps in the Redis path:
Architecture is sound and the wipe contract (conditional, fail-safe, idempotent, non-blocking on core shutdown) is correctly implemented. Blocking Issues1.
|
…efix + TLS test depth) Addresses four P2 findings from Luca's review on PR #74. P2-3 (lifecycle coupling — move wipe into the Rust core and add an on_mode_change hook to the framework) is deliberately deferred as a separate follow-up; it requires framework-level changes in mcp-context-forge that are out of scope for this PR. P2-1 — single-flight guard around the wipe. Without coordination, every worker that sees mode=disabled runs the full SCAN+DEL independently. With 3 replicas × 24 workers = 72 workers all wiping the same keyspace, that's ~72× the Redis work and contention on the same keys. Added _acquire_wipe_lock() / _release_wipe_lock() using SET ... NX EX 30 — only the worker that wins the lock performs the wipe; the rest skip cleanly. Lock key uses a dash separator (`<prefix>-wipe-lock`) so the wipe SCAN's `<prefix>:*` pattern doesn't match it. TTL bounds starvation if the lock holder dies mid-wipe; the lock is released eagerly after a successful wipe so the next disable cycle is responsive. P2-2 — UNLINK with DEL fallback. SCAN avoids blocking key enumeration, but DEL still frees values on Redis' main thread. For sliding-window sorted sets, dropping 500 of those at once can stutter Redis. Switched to UNLINK (Redis 4.0+, defers value-free to a background thread) with a ResponseError fallback to DEL for pre-4.0 deployments. P2-4 — TLS handshake-depth test. The existing rediss:// regression test only verifies URL parsing. Added test_rediss_url_lazy_handshake_reaches_connectivity_layer that drives a tool_pre_invoke and asserts the failure logs are connectivity-shaped (refused / IO / timed out), explicitly NOT "feature is not enabled" / InvalidClientConfig — the latter being the wo-tracker #68217 regression signature. Catches a future Cargo feature-flag mistake or a rustls regression that the construction test would silently pass. P2-5 — custom redis_key_prefix coverage. Every existing test uses the default prefix, so a regression that hard-coded "rl:*" would still pass. Added test_wipe_only_deletes_keys_under_configured_prefix that seeds keys under a custom prefix and an unrelated namespace, then asserts only the configured-prefix keys are deleted. Local gate: cargo fmt-check + clippy + 57 Rust tests; pytest 115/115 (112 prior + lock-skipped + custom-prefix + TLS-handshake-depth = 115). Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
lucarlig
left a comment
There was a problem hiding this comment.
Follow-up detailed review pass on the latest head found three remaining P2 findings.
…efix + TLS test depth) Addresses four P2 findings from Luca's review on PR #74. P2-3 (lifecycle coupling — move wipe into the Rust core and add an on_mode_change hook to the framework) is deliberately deferred as a separate follow-up; it requires framework-level changes in mcp-context-forge that are out of scope for this PR. P2-1 — single-flight guard around the wipe. Without coordination, every worker that sees mode=disabled runs the full SCAN+DEL independently. With 3 replicas × 24 workers = 72 workers all wiping the same keyspace, that's ~72× the Redis work and contention on the same keys. Added _acquire_wipe_lock() / _release_wipe_lock() using SET ... NX EX 30 — only the worker that wins the lock performs the wipe; the rest skip cleanly. Lock key uses a dash separator (`<prefix>-wipe-lock`) so the wipe SCAN's `<prefix>:*` pattern doesn't match it. TTL bounds starvation if the lock holder dies mid-wipe; the lock is released eagerly after a successful wipe so the next disable cycle is responsive. P2-2 — UNLINK with DEL fallback. SCAN avoids blocking key enumeration, but DEL still frees values on Redis' main thread. For sliding-window sorted sets, dropping 500 of those at once can stutter Redis. Switched to UNLINK (Redis 4.0+, defers value-free to a background thread) with a ResponseError fallback to DEL for pre-4.0 deployments. P2-4 — TLS handshake-depth test. The existing rediss:// regression test only verifies URL parsing. Added test_rediss_url_lazy_handshake_reaches_connectivity_layer that drives a tool_pre_invoke and asserts the failure logs are connectivity-shaped (refused / IO / timed out), explicitly NOT "feature is not enabled" / InvalidClientConfig — the latter being the wo-tracker #68217 regression signature. Catches a future Cargo feature-flag mistake or a rustls regression that the construction test would silently pass. P2-5 — custom redis_key_prefix coverage. Every existing test uses the default prefix, so a regression that hard-coded "rl:*" would still pass. Added test_wipe_only_deletes_keys_under_configured_prefix that seeds keys under a custom prefix and an unrelated namespace, then asserts only the configured-prefix keys are deleted. Local gate: cargo fmt-check + clippy + 57 Rust tests; pytest 115/115 (112 prior + lock-skipped + custom-prefix + TLS-handshake-depth = 115). Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
2b453b6 to
a477af2
Compare
Addresses Luca's P2-4 review feedback on PR #74: the construction-only TLS regression test verifies URL parsing but not the lazy handshake. Added test_rediss_url_lazy_handshake_reaches_connectivity_layer that drives a tool_pre_invoke against rediss://127.0.0.1:1/15 so the lazy TLS path actually engages, then asserts the failure is connectivity- shaped (refused / timed out / IO) rather than the wo-tracker #68217 'feature is not enabled' / InvalidClientConfig signature. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
a477af2 to
f5d306c
Compare
The redis crate was compiled without any TLS feature, so passing a
rediss:// URL (e.g. AWS ElastiCache with in-transit encryption, Redis
Cloud) failed at backend init with `InvalidClientConfig: can't connect
with TLS, the feature is not enabled`. The plugin manager then logged
"Failed to load plugin RateLimiterPlugin" and silently skipped the
plugin — bindings appeared configured but no rate limiting was applied.
Compile the redis dep with rustls-based TLS (pure Rust, no OpenSSL):
- tls-rustls
- tls-rustls-webpki-roots (Mozilla CA roots, needed for managed
Redis services without custom trust setup)
- tokio-rustls-comp (async support over rustls)
Added a regression test (TestRedisTlsSupport) that constructs a plugin
with a rediss:// URL and asserts no InvalidClientConfig is raised.
Bumps cpex_rate_limiter 0.0.4 → 0.0.5.
Refs: wo-tracker #68217
Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
… dep
cargo-deny security-policy CI was red on this branch with two findings:
1. RUSTSEC-2025-0134 — rustls-pemfile 2.2.0 is unmaintained.
Pulled in by redis 0.27 → rustls-native-certs 0.7.3.
2. License rejected — webpki-roots is CDLA-Permissive-2.0,
which is not in the workspace deny.toml allow-list.
Both come from the TLS feature additions in the previous commit.
Bumping the redis crate to 0.32 takes us to rustls-native-certs 0.8.x,
which migrated to rustls-pki-types and dropped the rustls-pemfile
dependency entirely. The advisory is closed at the root rather than
suppressed via deny.toml.
The license rejection is fixed by adding CDLA-Permissive-2.0 to the
workspace allow-list. CDLA-Permissive-2.0 is a Linux-Foundation
permissive data license used by webpki-roots for the Mozilla CA
bundle — keeping the bundle pinned in the binary keeps deployments
self-contained (no host-OS CA-store dependency in containers).
No source-code changes — redis 0.27 → 0.32 is API-compatible for the
APIs the Rust core uses (Client::open, MultiplexedConnection, cmd
builder, redis::Value pattern matching with _ arms, ErrorKind variants
we touch). All 47 redis call sites in redis_backend.rs compile
unchanged. Local gate green: cargo check-all (57 Rust tests),
pytest (112 Python tests).
Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Addresses Luca's P2-4 review feedback on PR #74: the construction-only TLS regression test verifies URL parsing but not the lazy handshake. Added test_rediss_url_lazy_handshake_reaches_connectivity_layer that drives a tool_pre_invoke against rediss://127.0.0.1:1/15 so the lazy TLS path actually engages, then asserts the failure is connectivity- shaped (refused / timed out / IO) rather than the wo-tracker #68217 'feature is not enabled' / InvalidClientConfig signature. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…me path
Two related changes that together make `rediss://` URLs actually work
end-to-end:
1. Install the rustls ring crypto provider once at Rust core init.
Previous TLS commits made `rediss://` URLs parse and the plugin
load without the wo-tracker #68217 "feature is not enabled" error,
but they left a runtime-time gap: rustls 0.23 dropped its implicit
default crypto provider, so the first real TLS handshake panicked
with `Call CryptoProvider::install_default()...`.
Added a direct dep on rustls 0.23 with the `ring` feature, plus an
OnceLock-guarded `ensure_crypto_provider()` in the Rust core that
calls `rustls::crypto::ring::default_provider().install_default()`
exactly once per process. Called from `RateLimiterPluginCore::new()`
before any redis client is constructed.
`ring` is the simpler default — pure Rust + small asm, BSD-3 / ISC
licensed, builds cleanly across the existing PyPI wheel matrix
without extra system toolchains. aws-lc-rs would be needed for
FIPS validation; not a requirement for this customer.
2. Drop the `tls-rustls-webpki-roots` feature on the redis crate.
That feature pulled `webpki-roots` (Mozilla's CA bundle, licensed
CDLA-Permissive-2.0) into the dep tree, which the workspace
deny.toml correctly rejected because that license is not in the
allow list. Adding the license to the allow list (or as an
exception) widens the workspace policy for one transitive crate.
Without `tls-rustls-webpki-roots`, the redis crate uses
`rustls-native-certs` (Apache-2.0 / MIT / ISC — already allowed)
to read CA certificates from the host OS trust store at runtime.
mcp-context-forge ships on UBI 10 Minimal, which includes
`ca-certificates` by default; AWS ElastiCache uses public CA
chains that are in any standard OS bundle. Operationally
equivalent for this deployment, with a clean license posture.
Reverts the `CDLA-Permissive-2.0` entry from the workspace
deny.toml allow list — no longer needed.
Test: the existing TLS lazy-handshake regression test
(test_rediss_url_lazy_handshake_reaches_connectivity_layer) now reaches
the network layer instead of panicking at TLS init. Its assertion is
scoped to the wo-tracker #68217 negative signature ("feature is not
enabled" / InvalidClientConfig must never appear). A positive "the lazy
handshake actually reached the network" assertion would require the
Rust core's log_exception() to surface the underlying RedisError text
instead of the generic "error; allowing request" message — tracked as
a follow-up alongside the real-TLS-Redis-fixture variant.
Local gate: cargo fmt-check + clippy clean, 57 Rust unit tests pass,
107 Python integration tests pass.
Cargo.lock confirms removal of unmaintained `rustls-pemfile` and
license-flagged `webpki-roots`; `rustls-native-certs` 0.8.3 is the only
remaining TLS-related dep.
Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Main bumped rate_limiter to 0.0.5 in #73 as part of a workspace-wide dependency refresh, so this PR's release slot moves to 0.0.6. The content of 0.0.6 is the TLS / rediss:// support work in this PR (crypto provider install, redis crate bump for advisory cleanup, TLS regression tests). Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
bdc0740 to
1969f17
Compare
…unit test
cargo-mutants flagged an uncovered mutation: replacing
ensure_crypto_provider's body with () passed the existing test suite,
because nothing in the Rust unit tests directly exercises the crypto
provider install (TLS coverage lives on the Python integration side).
Added a unit test in plugin.rs's tests module that calls
ensure_crypto_provider() and asserts
rustls::crypto::CryptoProvider::get_default().is_some().
- Real implementation installs the ring provider via OnceLock
→ get_default() returns Some → test passes.
- Mutated implementation does nothing → no provider installed
→ get_default() returns None → test fails → mutant killed.
OnceLock makes the call idempotent, so the test is order-independent
within a cargo-test process. No behavioral change; pure coverage.
Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Summary
Enables TLS for the rate limiter's Redis backend so deployments against managed Redis services with in-transit encryption (AWS ElastiCache, Redis Cloud, etc.) work end-to-end. Bumps
cpex_rate_limiter0.0.4 → 0.0.6.Gaps closed
Gap A (HIGH, wo-tracker #68217) — The Redis backend cannot connect to managed Redis services that require TLS. Operators using AWS ElastiCache with in-transit encryption (or any other
rediss://URL) can configure the binding successfully — the binding API returns HTTP 200 — but the plugin's backend init raisesInvalidClientConfig: can't connect with TLS, the feature is not enabled. The plugin manager logsFailed to load plugin RateLimiterPluginand silently skips it under the defaultfail_on_plugin_error: falsesetting, leaving rate limits unenforced while the operator believes them to be in place. Fixed by compiling the redis crate with rustls-based TLS features.Gap B (MEDIUM) — The redis crate version previously pinned (0.27) carried
rustls-pemfile2.2.0 transitively, whichcargo-denyflagged as unmaintained (RUSTSEC-2025-0134) and blocked CI'ssecurity-policystage. Fixed by bumping the redis crate to 0.32, where the rustls-native-certs upgrade dropped therustls-pemfiledep entirely.Gap C (MEDIUM) — rustls 0.23 dropped its implicit default crypto provider; the redis crate's
tls-rustlsfeature does not pick one. Without an explicit install, the first TLS handshake panicked withCall CryptoProvider::install_default() before this point.... The Python shim's broad except caught the panic andfail_mode=openallowed the request through, so deployments would still appear to load the plugin while quietly enforcing nothing — same observable end state as Gap A, different log message. Fixed by installing theringcrypto provider once at plugin init via aOnceLock-guarded call inRateLimiterPluginCore::new().Test results
Full local gate
New tests added in this PR
TestRedisTlsSupport— regression coverage for wo-tracker #68217test_rediss_url_does_not_fail_with_tls_not_enabledrediss://URL; asserts noInvalidClientConfigis raised at backend init. Pins Gap A.test_rediss_url_lazy_handshake_reaches_connectivity_layertool_pre_invokeagainst a closedrediss://port so the lazy TLS handshake actually engages; asserts the captured warning does not contain the wo-tracker #68217 signature (feature is not enabled/InvalidClientConfig). Pins Gap A and Gap C jointly.Migration notes
redis://URLs continue to work unchanged.rediss://URLs now work where they previously failed.rustls-native-certs. Container images that omitca-certificates(some distroless / scratch / unconfigured Alpine variants) would need to install the bundle before TLS endpoints can be verified. UBI Minimal — what mcp-context-forge ships on — already includes the CA bundle by default; AWS ElastiCache uses public-CA chains that are present there.Follow-ups
log_exception()to surface the underlyingRedisErrortext instead of the generic "error; allowing request" message — the error details are dropped at the Python boundary today.cpex-rate-limiterpin to 0.0.5 once published to PyPI.mode=disabled) and the round-1/round-2 review feedback addressing it have been moved to a separate branch (feat/rate-limiter-wipe-on-disable-only); a follow-up PR will be linked here once opened.