fix(rate-limiter): production hardening — Redis connection path by gandhipratik203 · Pull Request #78 · IBM/cpex-plugins

gandhipratik203 · 2026-05-04T20:43:39Z

Summary

Wraps Client::get_multiplexed_tokio_connection().await in a 2-second tokio::time::timeout so the rate-limiter fails fast when the configured Redis endpoint accepts TCP but never speaks at the application layer (plain redis:// against a TLS-required server, network ACL drops, firewalled cluster, etc.). Without the bound, the call blocks indefinitely; the framework's outer 30-second plugin timeout eventually kills the hook, and fail_mode cannot engage because the connection-acquisition future never returns to surface an error.

The timeout error is mapped into a redis::ErrorKind::IoError-shaped RedisError so the existing fail_mode path routes it the same way as any other connection-side failure.

Test coverage

Two layers — both fail-by-hang against the prior implementation, both pass against this commit:

Rust unit test (redis_backend::tests::connection_async_fails_fast_against_hanging_redis) — binds a TCP listener that accepts but never reads/writes; asserts connection_async returns within ~3s with an IoError-shaped error.
Python integration test (TestRedisFailModeAndViolationContext::test_hanging_redis_fails_fast_via_connect_timeout) — same setup pattern at the public-API layer; asserts tool_pre_invoke completes within ~5s and the default fail_mode=open allows the request through.

Each test is wrapped with an outer runaway-guard (tokio::time::timeout(5s) / asyncio.wait_for(..., 10.0)) so a regression doesn't hang the test run.

Full local gate

cargo fmt -- --check                  clean
cargo clippy -- -D warnings           clean
cargo test --release -p rate_limiter  59 passed
make test-integration                 60 passed

Configurability

The 2-second CONNECT_TIMEOUT is a hardcoded constant for now to keep the change small. Promoting it to a redis_connect_timeout_ms config key (extending the existing config-key list and warning machinery) is a trivial follow-up if operators in slow-network deployments need a longer budget.

Without a time-bound on ``Client::get_multiplexed_tokio_connection().await``, the rate-limiter blocks indefinitely against any Redis endpoint where the TCP handshake completes but the application layer never speaks (e.g. plain ``redis://`` against a TLS-required server, network ACL drops, firewalled cluster, etc.). Every request through the plugin then pays the framework's outer 30-second plugin timeout, and ``fail_mode`` cannot engage because the connection-acquisition future never returns to surface an error. This commit wraps the connection acquisition in ``tokio::time::timeout(Duration::from_secs(2), …)`` and maps the elapsed error into a ``redis::ErrorKind::IoError``-shaped ``RedisError`` so the existing ``fail_mode`` path routes it the same way as any other connection-side failure. Test coverage added at both layers: * Rust unit test (``redis_backend::tests::connection_async_fails_fast_against_hanging_redis``) — binds a TCP listener that accepts but never reads/writes; asserts ``connection_async`` returns within ~3s with an ``IoError``-shaped error. * Python integration test (``TestRedisFailModeAndViolationContext::test_hanging_redis_fails_fast_via_connect_timeout``) — same setup pattern at the public-API layer; asserts ``tool_pre_invoke`` completes within ~5s and the default ``fail_mode=open`` allows the request through. Both tests fail-by-hang against the prior implementation and pass against this commit; an outer ``tokio::time::timeout`` / ``asyncio.wait_for`` guards each so a regression doesn't hang the test run. The 2-second value is hardcoded as ``CONNECT_TIMEOUT`` for now to keep the change small. Promoting it to a ``redis_connect_timeout_ms`` config key (extending the existing config-key list and warning machinery) is a trivial follow-up if operators in slow-network deployments need a longer budget. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

Companion bump to the connection-acquisition timeout fix on this branch. 0.0.6 was the TLS-support release (cpex-plugins#74); 0.0.7 ships the timeout fix on top of it. No behavioural change in this commit on its own — version bump only. Cargo.lock regenerated via ``cargo update -p rate_limiter``. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

lucarlig · 2026-05-05T08:29:36Z

Rechecked current head 0e9bbeb after the update. Thanks for the version bump; that part is fixed now.

Still missing from the previous review:

test_hanging_redis_fails_fast_via_connect_timeout still only covers default fail_mode=open. Please add the same hanging TCP-accepted/app-silent path for fail_mode=closed and assert the BACKEND_UNAVAILABLE response shape (continue_processing=false, 503, Retry-After).
The Redis connect timeout is still a hard-coded Duration::from_secs(2) inside connection_async. The earlier concern remains: this is operator policy and can flip slow managed/TLS Redis paths into fail-open/fail-closed without a tuning knob. Please move this through config/defaults/unknown-key handling, or explicitly justify why this PR should ship with a fixed constant.
The Rust unit test still accepts ResponseError even though the implementation maps the explicit timeout to IoError with connection timeout context. Please tighten the assertion to the exact timeout contract.

I did not rerun the test suite; this is a diff re-check only.

…osed hanging-Redis test Addresses two review comments on PR #78: 1. Rust unit test ``connection_async_fails_fast_against_hanging_redis`` previously accepted both ``IoError`` and ``ResponseError`` for the timeout error. The implementation explicitly maps the elapsed-error into ``redis::ErrorKind::IoError``, so the test now pins exactly that variant (``assert_eq!(err.kind(), IoError)``). Anything else would mean the timeout is being routed through a different code path than the rest of the fail-mode logic. 2. Add ``test_hanging_redis_with_fail_mode_closed_blocks_with_backend_unavailable`` as a sibling to the existing ``fail_mode=open`` test. Same hanging-socket setup; flips ``fail_mode`` to ``closed`` and asserts the documented BACKEND_UNAVAILABLE response envelope: ``continue_processing=False``, ``violation.code='BACKEND_UNAVAILABLE'``, ``http_status_code=503``, and ``Retry-After`` present in ``violation.http_headers``. The pair now pin both halves of the operator's policy contract under the hanging-Redis failure shape. Out of scope of this commit: promoting the hardcoded ``CONNECT_TIMEOUT`` to a config key — addressed in a separate follow-up commit on this branch. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

Adds a second comment block alongside the existing "why bound this at all" rationale, explaining why the value is a hardcoded constant rather than a config key — keep the plugin's config surface small, 2 s covers typical production paths, and promoting to a knob is a trivial follow-up if a slow-network deployment ever surfaces. Captures the rationale in code so it's discoverable on next review of this constant, rather than only in PR-description history. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

gandhipratik203 · 2026-05-05T09:30:49Z

Thanks for the re-check.

On (2) — kept hardcoded, with rationale captured inline at redis_backend.rs so it's discoverable on next read.

The trade-off honestly: 2s is on the aggressive side compared to Lettuce/ioredis (10s) or .NET StackExchange (5s), and you're right that it's operator policy in principle. The reason we'd like to leave it as a constant for this PR: the plugin's config surface is small by design, and adding a knob that the vast majority of operators won't tune expands the schema for everyone else. 2 seconds covers typical production paths comfortably — intra-VPC and cross-AZ Redis connection establishment is well under 100ms, and managed Redis with TLS handshake adds only ~100-300ms on top, so the "slow managed/TLS Redis flip" requires a genuinely unusual ~1.5+ second handshake to bite.

If a deployment with deliberately slow networks surfaces, promoting this to redis_connect_timeout_ms is a small, well-scoped change — the engine's config-validation machinery (KNOWN list, lib.rs defaults, unknown-key warning) is already there. Happy to file it as a separate follow-up issue if you'd prefer to track it.

Let me know if you'd rather we add the knob now anyway — defensible either way and your call.

gandhipratik203 · 2026-05-05T09:32:45Z

On (1) and (3) — both addressed in bc0267b:

(1) Added test_hanging_redis_with_fail_mode_closed_blocks_with_backend_unavailable as a sibling to the existing fail_mode=open hanging-Redis test. Same socket setup; flips fail_mode to closed and asserts continue_processing=False, violation.code='BACKEND_UNAVAILABLE', violation.http_status_code=503, and Retry-After present in violation.http_headers. The pair now pin both halves of the operator's policy under the hanging-Redis failure shape.
(3) Tightened the Rust unit test from matches!(kind, IoError | ResponseError) to assert_eq!(err.kind(), IoError). Pins the exact contract that the timeout maps into the same fail-mode-routed error category as any other connection-side failure.

lucarlig

LGTM

gandhipratik203 requested review from lucarlig and msureshkumar88 as code owners May 4, 2026 20:43

gandhipratik203 changed the title ~~fix(rate-limiter): bound Redis connection acquisition with a 2s timeout~~ fix(rate-limiter): production hardening — Redis connection path May 5, 2026

gandhipratik203 added 2 commits May 5, 2026 09:21

gandhipratik203 force-pushed the fix/rate-limiter-redis-connect-timeout branch from 0cb67dd to 0e9bbeb Compare May 5, 2026 08:24

gandhipratik203 added 2 commits May 5, 2026 10:13

gandhipratik203 self-assigned this May 5, 2026

lucarlig mentioned this pull request May 5, 2026

Narrow CI plugin selection for plugin-scoped lockfile updates #80

Open

lucarlig approved these changes May 5, 2026

View reviewed changes

lucarlig merged commit dce4d53 into main May 5, 2026
45 checks passed

lucarlig deleted the fix/rate-limiter-redis-connect-timeout branch May 5, 2026 11:15

gandhipratik203 mentioned this pull request May 5, 2026

chore(deps): bump cpex-rate-limiter to 0.0.7 IBM/mcp-context-forge#4608

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rate-limiter): production hardening — Redis connection path#78

fix(rate-limiter): production hardening — Redis connection path#78
lucarlig merged 4 commits intomainfrom
fix/rate-limiter-redis-connect-timeout

gandhipratik203 commented May 4, 2026

Uh oh!

lucarlig commented May 5, 2026

Uh oh!

gandhipratik203 commented May 5, 2026

Uh oh!

gandhipratik203 commented May 5, 2026

Uh oh!

lucarlig left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gandhipratik203 commented May 4, 2026

Summary

Test coverage

Configurability

Uh oh!

lucarlig commented May 5, 2026

Uh oh!

gandhipratik203 commented May 5, 2026

Uh oh!

gandhipratik203 commented May 5, 2026

Uh oh!

lucarlig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants