Skip to content

fix(rate-limiter): production hardening — Redis connection path#78

Merged
lucarlig merged 4 commits intomainfrom
fix/rate-limiter-redis-connect-timeout
May 5, 2026
Merged

fix(rate-limiter): production hardening — Redis connection path#78
lucarlig merged 4 commits intomainfrom
fix/rate-limiter-redis-connect-timeout

Conversation

@gandhipratik203
Copy link
Copy Markdown
Collaborator

Summary

Wraps Client::get_multiplexed_tokio_connection().await in a 2-second tokio::time::timeout so the rate-limiter fails fast when the configured Redis endpoint accepts TCP but never speaks at the application layer (plain redis:// against a TLS-required server, network ACL drops, firewalled cluster, etc.). Without the bound, the call blocks indefinitely; the framework's outer 30-second plugin timeout eventually kills the hook, and fail_mode cannot engage because the connection-acquisition future never returns to surface an error.

The timeout error is mapped into a redis::ErrorKind::IoError-shaped RedisError so the existing fail_mode path routes it the same way as any other connection-side failure.

Test coverage

Two layers — both fail-by-hang against the prior implementation, both pass against this commit:

  • Rust unit test (redis_backend::tests::connection_async_fails_fast_against_hanging_redis) — binds a TCP listener that accepts but never reads/writes; asserts connection_async returns within ~3s with an IoError-shaped error.
  • Python integration test (TestRedisFailModeAndViolationContext::test_hanging_redis_fails_fast_via_connect_timeout) — same setup pattern at the public-API layer; asserts tool_pre_invoke completes within ~5s and the default fail_mode=open allows the request through.

Each test is wrapped with an outer runaway-guard (tokio::time::timeout(5s) / asyncio.wait_for(..., 10.0)) so a regression doesn't hang the test run.

Full local gate
cargo fmt -- --check                  clean
cargo clippy -- -D warnings           clean
cargo test --release -p rate_limiter  59 passed
make test-integration                 60 passed

Configurability

The 2-second CONNECT_TIMEOUT is a hardcoded constant for now to keep the change small. Promoting it to a redis_connect_timeout_ms config key (extending the existing config-key list and warning machinery) is a trivial follow-up if operators in slow-network deployments need a longer budget.

@gandhipratik203 gandhipratik203 changed the title fix(rate-limiter): bound Redis connection acquisition with a 2s timeout fix(rate-limiter): production hardening — Redis connection path May 5, 2026
Without a time-bound on ``Client::get_multiplexed_tokio_connection().await``,
the rate-limiter blocks indefinitely against any Redis endpoint where the
TCP handshake completes but the application layer never speaks
(e.g. plain ``redis://`` against a TLS-required server, network ACL drops,
firewalled cluster, etc.).  Every request through the plugin then pays
the framework's outer 30-second plugin timeout, and ``fail_mode`` cannot
engage because the connection-acquisition future never returns to surface
an error.

This commit wraps the connection acquisition in
``tokio::time::timeout(Duration::from_secs(2), …)`` and maps the elapsed
error into a ``redis::ErrorKind::IoError``-shaped ``RedisError`` so the
existing ``fail_mode`` path routes it the same way as any other
connection-side failure.

Test coverage added at both layers:

  * Rust unit test (``redis_backend::tests::connection_async_fails_fast_against_hanging_redis``)
    — binds a TCP listener that accepts but never reads/writes; asserts
    ``connection_async`` returns within ~3s with an ``IoError``-shaped
    error.

  * Python integration test
    (``TestRedisFailModeAndViolationContext::test_hanging_redis_fails_fast_via_connect_timeout``)
    — same setup pattern at the public-API layer; asserts
    ``tool_pre_invoke`` completes within ~5s and the default
    ``fail_mode=open`` allows the request through.

Both tests fail-by-hang against the prior implementation and pass against
this commit; an outer ``tokio::time::timeout`` / ``asyncio.wait_for``
guards each so a regression doesn't hang the test run.

The 2-second value is hardcoded as ``CONNECT_TIMEOUT`` for now to keep
the change small.  Promoting it to a ``redis_connect_timeout_ms`` config
key (extending the existing config-key list and warning machinery) is a
trivial follow-up if operators in slow-network deployments need a longer
budget.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Companion bump to the connection-acquisition timeout fix on this
branch.  0.0.6 was the TLS-support release (cpex-plugins#74); 0.0.7
ships the timeout fix on top of it.

No behavioural change in this commit on its own — version bump only.
Cargo.lock regenerated via ``cargo update -p rate_limiter``.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
@gandhipratik203 gandhipratik203 force-pushed the fix/rate-limiter-redis-connect-timeout branch from 0cb67dd to 0e9bbeb Compare May 5, 2026 08:24
@lucarlig
Copy link
Copy Markdown
Collaborator

lucarlig commented May 5, 2026

Rechecked current head 0e9bbeb after the update. Thanks for the version bump; that part is fixed now.

Still missing from the previous review:

  1. test_hanging_redis_fails_fast_via_connect_timeout still only covers default fail_mode=open. Please add the same hanging TCP-accepted/app-silent path for fail_mode=closed and assert the BACKEND_UNAVAILABLE response shape (continue_processing=false, 503, Retry-After).

  2. The Redis connect timeout is still a hard-coded Duration::from_secs(2) inside connection_async. The earlier concern remains: this is operator policy and can flip slow managed/TLS Redis paths into fail-open/fail-closed without a tuning knob. Please move this through config/defaults/unknown-key handling, or explicitly justify why this PR should ship with a fixed constant.

  3. The Rust unit test still accepts ResponseError even though the implementation maps the explicit timeout to IoError with connection timeout context. Please tighten the assertion to the exact timeout contract.

I did not rerun the test suite; this is a diff re-check only.

…osed hanging-Redis test

Addresses two review comments on PR #78:

1. Rust unit test ``connection_async_fails_fast_against_hanging_redis``
   previously accepted both ``IoError`` and ``ResponseError`` for the
   timeout error.  The implementation explicitly maps the elapsed-error
   into ``redis::ErrorKind::IoError``, so the test now pins exactly that
   variant (``assert_eq!(err.kind(), IoError)``).  Anything else would
   mean the timeout is being routed through a different code path than
   the rest of the fail-mode logic.

2. Add ``test_hanging_redis_with_fail_mode_closed_blocks_with_backend_unavailable``
   as a sibling to the existing ``fail_mode=open`` test.  Same
   hanging-socket setup; flips ``fail_mode`` to ``closed`` and asserts
   the documented BACKEND_UNAVAILABLE response envelope:
   ``continue_processing=False``, ``violation.code='BACKEND_UNAVAILABLE'``,
   ``http_status_code=503``, and ``Retry-After`` present in
   ``violation.http_headers``.  The pair now pin both halves of the
   operator's policy contract under the hanging-Redis failure shape.

Out of scope of this commit: promoting the hardcoded
``CONNECT_TIMEOUT`` to a config key — addressed in a separate
follow-up commit on this branch.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Adds a second comment block alongside the existing "why bound this
at all" rationale, explaining why the value is a hardcoded constant
rather than a config key — keep the plugin's config surface small,
2 s covers typical production paths, and promoting to a knob is a
trivial follow-up if a slow-network deployment ever surfaces.

Captures the rationale in code so it's discoverable on next review
of this constant, rather than only in PR-description history.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
@gandhipratik203
Copy link
Copy Markdown
Collaborator Author

Thanks for the re-check.

On (2) — kept hardcoded, with rationale captured inline at redis_backend.rs so it's discoverable on next read.

The trade-off honestly: 2s is on the aggressive side compared to Lettuce/ioredis (10s) or .NET StackExchange (5s), and you're right that it's operator policy in principle. The reason we'd like to leave it as a constant for this PR: the plugin's config surface is small by design, and adding a knob that the vast majority of operators won't tune expands the schema for everyone else. 2 seconds covers typical production paths comfortably — intra-VPC and cross-AZ Redis connection establishment is well under 100ms, and managed Redis with TLS handshake adds only ~100-300ms on top, so the "slow managed/TLS Redis flip" requires a genuinely unusual ~1.5+ second handshake to bite.

If a deployment with deliberately slow networks surfaces, promoting this to redis_connect_timeout_ms is a small, well-scoped change — the engine's config-validation machinery (KNOWN list, lib.rs defaults, unknown-key warning) is already there. Happy to file it as a separate follow-up issue if you'd prefer to track it.

Let me know if you'd rather we add the knob now anyway — defensible either way and your call.

@gandhipratik203
Copy link
Copy Markdown
Collaborator Author

On (1) and (3) — both addressed in bc0267b:

  • (1) Added test_hanging_redis_with_fail_mode_closed_blocks_with_backend_unavailable as a sibling to the existing fail_mode=open hanging-Redis test. Same socket setup; flips fail_mode to closed and asserts continue_processing=False, violation.code='BACKEND_UNAVAILABLE', violation.http_status_code=503, and Retry-After present in violation.http_headers. The pair now pin both halves of the operator's policy under the hanging-Redis failure shape.

  • (3) Tightened the Rust unit test from matches!(kind, IoError | ResponseError) to assert_eq!(err.kind(), IoError). Pins the exact contract that the timeout maps into the same fail-mode-routed error category as any other connection-side failure.

Copy link
Copy Markdown
Collaborator

@lucarlig lucarlig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lucarlig lucarlig merged commit dce4d53 into main May 5, 2026
45 checks passed
@lucarlig lucarlig deleted the fix/rate-limiter-redis-connect-timeout branch May 5, 2026 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants